R3 Q8: Heterogeneity Analysis - Continued¶

Continuation of Heterogeneity Discussion¶

This notebook runs the complete pathway analysis to demonstrate biological heterogeneity in myocardial infarction.

Key Points:

  • Runs deviation-from-reference clustering to identify distinct pathways to MI
  • Generates pathway visualizations and statistics
  • Demonstrates that "MI" is not a single entity but arises from different biological pathways

Note on Methods: This analysis uses deviation-from-reference clustering for pathway discovery, which differs from the main paper's approach (time-averaged signature loadings). This is an exploratory analysis to demonstrate heterogeneity concepts.

Setup and Parameters¶

Configure the pathway analysis parameters:

In [ ]:
# ============================================================================
# SETUP: Import and Configure
# ============================================================================
import sys
import os
%load_ext autoreload
sys.path.append('/Users/sarahurbut/aladynoulli2/pyScripts/new_oct_revision')

from helper_py.run_complete_pathway_analysis_deviation_only import run_deviation_only_analysis
from helper_py.run_transition_analysis_ukb_mgb import run_transition_analysis_both_cohorts
from helper_py.analyze_sig5_by_pathway import analyze_signature5_by_pathway
from helper_py.show_pathway_reproducibility import main as show_reproducibility

# Analysis parameters
target_disease = "myocardial infarction"
transition_disease = "Rheumatoid arthritis"
n_pathways = 4
lookback_years = 10
output_dir = 'complete_pathway_analysis_output'
mgb_model_path = '/Users/sarahurbut/Dropbox-Personal/model_with_kappa_bigam_MGB.pt'

print("="*80)
print("PATHWAY ANALYSIS CONFIGURATION")
print("="*80)
print(f"Target disease: {target_disease}")
print(f"Transition disease: {transition_disease}")
print(f"Number of pathways: {n_pathways}")
print(f"Lookback years: {lookback_years}")
print(f"Output directory: {output_dir}")
print("="*80)
================================================================================
PATHWAY ANALYSIS CONFIGURATION
================================================================================
Target disease: myocardial infarction
Transition disease: Rheumatoid arthritis
Number of pathways: 4
Lookback years: 10
Output directory: complete_pathway_analysis_output
================================================================================

Verify Data Loading¶

Check that we're loading the correct discovery thetas:

In [2]:
# ============================================================================
# VERIFY THETAS LOADING
# ============================================================================
# Check which thetas file is used for pathway discovery
import torch
from pathlib import Path

# The discovery thetas are loaded by load_full_data() in pathway_discovery.py
# They come from: /Users/sarahurbut/aladynoulli2/pyScripts/new_thetas_with_pcs_retrospective.pt
#
# These thetas are created by assemble_new_model_with_pcs():
# - Loads lambda_ from all batch model files (0-400K patients in 10K batches)
# - Concatenates all batches
# - Applies softmax to convert lambdas → thetas
# - Saves to new_thetas_with_pcs_retrospective.pt

thetas_path = Path('/Users/sarahurbut/aladynoulli2/pyScripts/pt/new_thetas_with_pcs_retrospective.pt')

print("="*80)
print("VERIFYING DISCOVERY THETAS")
print("="*80)

if thetas_path.exists():
    print(f"\n✅ Discovery thetas file found:")
    print(f"   {thetas_path}")
    
    # Load and check shape
    thetas = torch.load(thetas_path, map_location='cpu')
    if torch.is_tensor(thetas):
        thetas = thetas.numpy()
    
    print(f"\n   Thetas shape: {thetas.shape}")
    print(f"   - N (patients): {thetas.shape[0]:,}")
    print(f"   - K (signatures): {thetas.shape[1]}")
    print(f"   - T (timepoints): {thetas.shape[2]}")
    print(f"\n   Range: [{thetas.min():.4f}, {thetas.max():.4f}]")
    print(f"   Mean: {thetas.mean():.4f}")
    print(f"   Std: {thetas.std():.4f}")
    
    print(f"\n💡 Source: These thetas are assembled from all batch model files")
    print(f"   (0-400K patients) by applying softmax to lambda_ values")
    print(f"   They will be loaded automatically by run_deviation_only_analysis()")
else:
    print(f"\n⚠️  Discovery thetas file not found:")
    print(f"   {thetas_path}")
    print(f"\n   The analysis will fail if this file doesn't exist!")
    print(f"   Run assemble_new_model_with_pcs() to create this file")

print("="*80)
================================================================================
VERIFYING DISCOVERY THETAS
================================================================================

✅ Discovery thetas file found:
   /Users/sarahurbut/aladynoulli2/pyScripts/pt/new_thetas_with_pcs_retrospective.pt

   Thetas shape: (400000, 21, 52)
   - N (patients): 400,000
   - K (signatures): 21
   - T (timepoints): 52

   Range: [0.0000, 0.9966]
   Mean: 0.0476
   Std: 0.0549

💡 Source: These thetas are assembled from all batch model files
   (0-400K patients) by applying softmax to lambda_ values
   They will be loaded automatically by run_deviation_only_analysis()
================================================================================

STEP 1: Run Pathway Discovery (UKB)¶

Execute the complete pathway analysis pipeline:

In [3]:
# ============================================================================
# STEP 1: DEVIATION-BASED PATHWAY DISCOVERY (UKB)
# ============================================================================
# This will:
# 1. Load full dataset (Y, thetas, disease_names, processed_ids)
#    - Thetas loaded from: new_thetas_with_pcs_retrospective.pt (discovery thetas)
# 2. Discover pathways using deviation-from-reference clustering
# 3. Interrogate pathways (signatures, diseases, medications, PRS)
# 4. Generate all visualizations
# 5. Save results to output directory

os.makedirs(output_dir, exist_ok=True)
results = {}

print("="*80)
print("STEP 1: PATHWAY DISCOVERY (UKB)")
print("="*80)


# Run the complete analysis
results['ukb'] = run_deviation_only_analysis(
    target_disease=target_disease,
    n_pathways=n_pathways,
    output_dir=os.path.join(output_dir, 'ukb_pathway_discovery'),
    lookback_years=lookback_years
)

print("\n✅ UKB pathway discovery complete")
================================================================================
STEP 1: PATHWAY DISCOVERY (UKB)
================================================================================
================================================================================
COMPLETE PATHWAY ANALYSIS: MYOCARDIAL INFARCTION
Method: Deviation-from-Reference (10-year lookback)
================================================================================

1. LOADING FULL DATASET
Loading full dataset...
Loaded Y (full): torch.Size([407878, 348, 52])
Loaded thetas: (400000, 21, 52)
Loaded 400000 processed IDs
Subset Y to first 400K patients: torch.Size([400000, 348, 52])
Loaded 348 diseases
Total patients with complete data: 400000

2. DISCOVERING PATHWAYS TO MYOCARDIAL INFARCTION
Using Deviation-from-Reference Method (10-year lookback)
=== DISCOVERING PATHWAYS TO MYOCARDIAL INFARCTION ===
Method: deviation_from_reference
Lookback years: 10
Found target disease: Myocardial infarction (index 112)
Found 24920 patients who developed myocardial infarction

Creating trajectory features for pathway discovery...
Method: deviation_from_reference

--- COMPUTING POPULATION REFERENCE FOR DEVIATION-BASED CLUSTERING ---
Computing population-level signature reference from all 400000 patients...
Population reference shape: (21, 52)
Created 210 features per patient (DEVIATION from reference)
  - 210 features: deviation per signature per timepoint (K signatures × 10 timepoints)
Kept 24803 patients with sufficient pre-disease history

Discovered 4 pathways to myocardial infarction:
  Pathway 0: 1836 patients (7.4%)
  Pathway 1: 11108 patients (44.6%)
  Pathway 2: 4439 patients (17.8%)
  Pathway 3: 7420 patients (29.8%)

3. INTERROGATING PATHWAYS
   This step includes:
   - Most discriminating signatures
   - Disease prevalence patterns
   - Signature deviation trajectories by pathway
=== INTERROGATING PATHWAYS TO MYOCARDIAL INFARCTION ===

1. PATHWAY STATISTICS:
   Pathway 0: 1836 patients (7.4%)
   Pathway 1: 11108 patients (44.8%)
   Pathway 2: 4439 patients (17.9%)
   Pathway 3: 7420 patients (29.9%)

2. CALCULATING SIGNATURE TRAJECTORIES:
   Pathway 0: 1836 patients
   Pathway 1: 11108 patients
   Pathway 2: 4439 patients
   Pathway 3: 7420 patients

3. MOST DISCRIMINATING SIGNATURES:
   Top 5 discriminating signatures:
     1. Signature 5: Score = 0.6117
     2. Signature 20: Score = 0.2352
     3. Signature 12: Score = 0.1457
     4. Signature 1: Score = 0.1209
     5. Signature 18: Score = 0.1171

4. DISEASE PATTERNS BY PATHWAY (PRE-TARGET DISEASE):
   Pathway 0 top PRE-disease conditions:
     1. Coronary atherosclerosis: 1584 patients (86.3%)
     2. Hypercholesterolemia: 1393 patients (75.9%)
     3. Angina pectoris: 1377 patients (75.0%)
     4. Essential hypertension: 1374 patients (74.8%)
     5. Other chronic ischemic heart disease, unspecified: 1371 patients (74.7%)
     6. Unstable angina (intermediate coronary syndrome): 647 patients (35.2%)
     7. Type 2 diabetes: 481 patients (26.2%)
     8. Arthropathy NOS: 330 patients (18.0%)
     9. Atrial fibrillation and flutter: 324 patients (17.6%)
     10. Hyperlipidemia: 267 patients (14.5%)
   Pathway 1 top PRE-disease conditions:
     1. Essential hypertension: 2308 patients (20.8%)
     2. Hypercholesterolemia: 1105 patients (9.9%)
     3. Other chronic ischemic heart disease, unspecified: 982 patients (8.8%)
     4. Coronary atherosclerosis: 927 patients (8.3%)
     5. Arthropathy NOS: 891 patients (8.0%)
     6. Angina pectoris: 823 patients (7.4%)
     7. Type 2 diabetes: 759 patients (6.8%)
     8. Diverticulosis: 708 patients (6.4%)
     9. Cataract: 604 patients (5.4%)
     10. Diaphragmatic hernia: 571 patients (5.1%)
   Pathway 2 top PRE-disease conditions:
     1. Essential hypertension: 2896 patients (65.2%)
     2. Arthropathy NOS: 1570 patients (35.4%)
     3. Hypercholesterolemia: 1320 patients (29.7%)
     4. Diverticulosis: 1166 patients (26.3%)
     5. Diaphragmatic hernia: 1156 patients (26.0%)
     6. Other chronic ischemic heart disease, unspecified: 1048 patients (23.6%)
     7. Angina pectoris: 1014 patients (22.8%)
     8. Asthma: 992 patients (22.3%)
     9. Coronary atherosclerosis: 902 patients (20.3%)
     10. Benign neoplasm of colon: 897 patients (20.2%)
   Pathway 3 top PRE-disease conditions:
     1. Essential hypertension: 2066 patients (27.8%)
     2. Hypercholesterolemia: 1245 patients (16.8%)
     3. Coronary atherosclerosis: 1183 patients (15.9%)
     4. Type 2 diabetes: 989 patients (13.3%)
     5. Other chronic ischemic heart disease, unspecified: 935 patients (12.6%)
     6. Angina pectoris: 915 patients (12.3%)
     7. Arthropathy NOS: 559 patients (7.5%)
     8. Asthma: 445 patients (6.0%)
     9. Atrial fibrillation and flutter: 435 patients (5.9%)
     10. Inguinal hernia: 412 patients (5.6%)

4b. DISEASES THAT DIFFERENTIATE PATHWAYS:
   Top 15 diseases that differentiate pathways (by variance in prevalence):
     1. Coronary atherosclerosis
        Pathway 0: 1584 patients (86.3%)
        Pathway 1: 927 patients (8.3%)
        Pathway 2: 902 patients (20.3%)
        Pathway 3: 1183 patients (15.9%)
     2. Angina pectoris
        Pathway 0: 1377 patients (75.0%)
        Pathway 1: 823 patients (7.4%)
        Pathway 2: 1014 patients (22.8%)
        Pathway 3: 915 patients (12.3%)
     3. Other chronic ischemic heart disease, unspecified
        Pathway 0: 1371 patients (74.7%)
        Pathway 1: 982 patients (8.8%)
        Pathway 2: 1048 patients (23.6%)
        Pathway 3: 935 patients (12.6%)
     4. Hypercholesterolemia
        Pathway 0: 1393 patients (75.9%)
        Pathway 1: 1105 patients (9.9%)
        Pathway 2: 1320 patients (29.7%)
        Pathway 3: 1245 patients (16.8%)
     5. Essential hypertension
        Pathway 0: 1374 patients (74.8%)
        Pathway 1: 2308 patients (20.8%)
        Pathway 2: 2896 patients (65.2%)
        Pathway 3: 2066 patients (27.8%)
     6. Unstable angina (intermediate coronary syndrome)
        Pathway 0: 647 patients (35.2%)
        Pathway 1: 309 patients (2.8%)
        Pathway 2: 323 patients (7.3%)
        Pathway 3: 360 patients (4.9%)
     7. Arthropathy NOS
        Pathway 0: 330 patients (18.0%)
        Pathway 1: 891 patients (8.0%)
        Pathway 2: 1570 patients (35.4%)
        Pathway 3: 559 patients (7.5%)
     8. Diaphragmatic hernia
        Pathway 0: 240 patients (13.1%)
        Pathway 1: 571 patients (5.1%)
        Pathway 2: 1156 patients (26.0%)
        Pathway 3: 399 patients (5.4%)
     9. Diverticulosis
        Pathway 0: 223 patients (12.1%)
        Pathway 1: 708 patients (6.4%)
        Pathway 2: 1166 patients (26.3%)
        Pathway 3: 351 patients (4.7%)
     10. Type 2 diabetes
        Pathway 0: 481 patients (26.2%)
        Pathway 1: 759 patients (6.8%)
        Pathway 2: 825 patients (18.6%)
        Pathway 3: 989 patients (13.3%)
     11. Asthma
        Pathway 0: 197 patients (10.7%)
        Pathway 1: 506 patients (4.6%)
        Pathway 2: 992 patients (22.3%)
        Pathway 3: 445 patients (6.0%)
     12. Benign neoplasm of colon
        Pathway 0: 174 patients (9.5%)
        Pathway 1: 487 patients (4.4%)
        Pathway 2: 897 patients (20.2%)
        Pathway 3: 301 patients (4.1%)
     13. GERD
        Pathway 0: 159 patients (8.7%)
        Pathway 1: 313 patients (2.8%)
        Pathway 2: 833 patients (18.8%)
        Pathway 3: 251 patients (3.4%)
     14. Gastritis and duodenitis
        Pathway 0: 165 patients (9.0%)
        Pathway 1: 370 patients (3.3%)
        Pathway 2: 784 patients (17.7%)
        Pathway 3: 304 patients (4.1%)
     15. Obesity
        Pathway 0: 216 patients (11.8%)
        Pathway 1: 243 patients (2.2%)
        Pathway 2: 703 patients (15.8%)
        Pathway 3: 303 patients (4.1%)

5. CREATING PATHWAY VISUALIZATIONS:
   Saved plot: complete_pathway_analysis_output/ukb_pathway_discovery/top_discriminating_signatures.pdf
No description has been provided for this image
   Saved plot: complete_pathway_analysis_output/ukb_pathway_discovery/pathway_size_and_age.pdf
No description has been provided for this image
6. CREATING STACKED SIGNATURE DEVIATION PLOTS:
   Saved plot: complete_pathway_analysis_output/ukb_pathway_discovery/signature_deviations_by_pathway.pdf
No description has been provided for this image
Summary of signature deviations (5 years before disease):
  Pathway 0: Total absolute deviation = 0.370
    Top 3 signatures: [(5, 0.17910814), (16, -0.041025706), (7, -0.019809388)]
  Pathway 1: Total absolute deviation = 0.112
    Top 3 signatures: [(8, 0.02484078), (16, -0.017946318), (3, 0.0149342865)]
  Pathway 2: Total absolute deviation = 0.223
    Top 3 signatures: [(16, -0.032341897), (7, 0.031873554), (10, -0.019288678)]
  Pathway 3: Total absolute deviation = 0.615
    Top 3 signatures: [(5, 0.22926128), (8, -0.118373916), (3, -0.073842525)]

3b. CREATING SIGNATURE DEVIATION PLOTS
   Saved stacked deviation plot: complete_pathway_analysis_output/ukb_pathway_discovery/signature_deviations_myocardial_infarction_10yr_stacked.pdf
   Saved line deviation plot: complete_pathway_analysis_output/ukb_pathway_discovery/signature_deviations_myocardial_infarction_10yr_line.pdf

4. ANALYZING MEDICATION DIFFERENCES BY PATHWAY
=== INTEGRATING LONG-TERM MEDICATIONS WITH SIGNATURE PATHWAYS ===
Loading medication data from /Users/sarahurbut/Library/CloudStorage/Dropbox-Personal/gp_scripts.txt...
✅ Loaded 56,212,343 prescription records
   From 222,044 unique patients
   Covering 20,937 unique medications
Analyzing long-term medication patterns...
=== ANALYZING PRESCRIPTION DURATION PATTERNS ===
Looking for medications with ≥5 prescriptions over ≥5 years
Valid prescription dates: 56,206,252 / 56,212,343
Available columns: ['eid', 'data_provider', 'issue_date', 'read_2', 'bnf_code', 'dmd_code', 'drug_name', 'quantity']
✅ Found 'drug_name' column with 80678 unique drugs
✅ Found 602,144 long-term medication patterns
   Involving 119,773 unique patients
   Across 16,244 unique medications

=== IDENTIFYING SYSTEMATIC LONG-TERM DRUGS ===
Criteria: ≥100 patients, ≥3 years average duration
✅ Found 751 systematic long-term drugs

=== DRUG CATEGORY ANALYSIS ===
Top therapeutic categories for long-term medications:
                                                 total_patients  avg_duration  \
bnf_category category_name                                                      
02           Cardiovascular system                       147384          9.17   
04           Central nervous system                       65196          9.83   
06           Endocrine system                             51616          8.48   
01           Gastro-intestinal system                     46857          9.28   
03           Respiratory system                           38491          8.71   
05           Infections                                   35923          9.89   
10           Musculoskeletal and joint diseases           24786          9.64   
13           Skin                                         22124          9.98   
12           Ear, nose and oropharynx                     10862          9.82   
09           Nutrition and blood                           9726          8.71   

                                                 n_drugs  
bnf_category category_name                                
02           Cardiovascular system                   151  
04           Central nervous system                   96  
06           Endocrine system                         82  
01           Gastro-intestinal system                 58  
03           Respiratory system                       64  
05           Infections                               37  
10           Musculoskeletal and joint diseases       43  
13           Skin                                     63  
12           Ear, nose and oropharynx                 16  
09           Nutrition and blood                      20  
Pathway patient IDs range: 9 to 399997
Sample pathway patient IDs: [9, 87, 90, 93, 146, 151, 153, 158, 169, 191]
Medication eid range: 1000015 to 6026612
Sample medication eids: [1000015, 1000037, 1000086, 1000113, 1000134, 1000169, 1000198, 1000201, 1000210, 1000232]
Using processed IDs for mapping (have 400000 processed IDs)
Successfully mapped 24803 pathway patients to eids
Sample mappings: {9: 1000113, 87: 1001140, 90: 1001175, 93: 1001215, 146: 1001881}
Found 6793368 medication records for pathway patients
Unique patients with medication data: 11066

Analyzing long-term medication patterns for 24803 patients with myocardial infarction
Pathway 0: 1836 patients, 636 with long-term meds
  Long-term medications: 556 unique drugs
  Average medication diversity: 9.42
  Top 3 long-term meds: ['Aspirin 75mg dispersible tablets', 'Simvastatin 40mg tablets', 'Omeprazole 20mg gastro-resistant capsules']
  Top 3 BNF categories: ['Cardiovascular system', 'Central nervous system', 'Gastro-intestinal system']
  Systematic drugs in pathway: 556
    • Simvastatin 40mg tablets: 14081 patients, 8.2 years avg
    • Omeprazole 20mg gastro-resistant capsules: 12523 patients, 9.9 years avg
    • Bendroflumethiazide 2.5mg tablets: 12249 patients, 11.3 years avg
Pathway 1: 11108 patients, 3343 with long-term meds
  Long-term medications: 737 unique drugs
  Average medication diversity: 6.52
  Top 3 long-term meds: ['Aspirin 75mg dispersible tablets', 'Simvastatin 40mg tablets', 'Amoxicillin 500mg capsules']
  Top 3 BNF categories: ['Cardiovascular system', 'Central nervous system', 'Gastro-intestinal system']
  Systematic drugs in pathway: 737
    • Simvastatin 40mg tablets: 14081 patients, 8.2 years avg
    • Omeprazole 20mg gastro-resistant capsules: 12523 patients, 9.9 years avg
    • Bendroflumethiazide 2.5mg tablets: 12249 patients, 11.3 years avg
Pathway 2: 4439 patients, 1550 with long-term meds
  Long-term medications: 703 unique drugs
  Average medication diversity: 8.08
  Top 3 long-term meds: ['Aspirin 75mg dispersible tablets', 'Omeprazole 20mg gastro-resistant capsules', 'Paracetamol 500mg tablets']
  Top 3 BNF categories: ['Cardiovascular system', 'Central nervous system', 'Gastro-intestinal system']
  Systematic drugs in pathway: 703
    • Simvastatin 40mg tablets: 14081 patients, 8.2 years avg
    • Omeprazole 20mg gastro-resistant capsules: 12523 patients, 9.9 years avg
    • Bendroflumethiazide 2.5mg tablets: 12249 patients, 11.3 years avg
Pathway 3: 7420 patients, 2373 with long-term meds
  Long-term medications: 711 unique drugs
  Average medication diversity: 7.60
  Top 3 long-term meds: ['Aspirin 75mg dispersible tablets', 'Simvastatin 40mg tablets', 'Paracetamol 500mg tablets']
  Top 3 BNF categories: ['Cardiovascular system', 'Central nervous system', 'Endocrine system']
  Systematic drugs in pathway: 711
    • Simvastatin 40mg tablets: 14081 patients, 8.2 years avg
    • Omeprazole 20mg gastro-resistant capsules: 12523 patients, 9.9 years avg
    • Bendroflumethiazide 2.5mg tablets: 12249 patients, 11.3 years avg

=== PATHWAY-SPECIFIC MEDICATION PATTERNS ===
Analyzing 15 unique medications across 4 pathways

Pathway 0 MEDICATION PATTERNS:
  Total patients: 1836

  Top 10 differentiating medications (ranked by fold enrichment):
    1. Clopidogrel 75mg tablets:
       This pathway: 110 patients (6.0%)
       Other pathways: 0.0% average
       Fold enrichment: 5991285.40x
    2. Aspirin 75mg gastro-resistant tablets:
       This pathway: 119 patients (6.5%)
       Other pathways: 1.0% average
       Fold enrichment: 6.28x
    3. Glyceryl trinitrate 400micrograms/dose aerosol sublingual spray:
       This pathway: 122 patients (6.6%)
       Other pathways: 1.3% average
       Fold enrichment: 5.23x
    4. Atenolol 50mg tablets:
       This pathway: 125 patients (6.8%)
       Other pathways: 3.8% average
       Fold enrichment: 1.81x
    5. Aspirin 75mg dispersible tablets:
       This pathway: 346 patients (18.8%)
       Other pathways: 11.0% average
       Fold enrichment: 1.72x
    6. Omeprazole 20mg gastro-resistant capsules:
       This pathway: 135 patients (7.4%)
       Other pathways: 5.1% average
       Fold enrichment: 1.45x
    7. Simvastatin 40mg tablets:
       This pathway: 190 patients (10.3%)
       Other pathways: 7.6% average
       Fold enrichment: 1.37x
    8. Lansoprazole 30mg gastro-resistant capsules:
       This pathway: 98 patients (5.3%)
       Other pathways: 4.0% average
       Fold enrichment: 1.34x
    9. Paracetamol 500mg tablets:
       This pathway: 128 patients (7.0%)
       Other pathways: 5.2% average
       Fold enrichment: 1.34x
    10. Amoxicillin 500mg capsules:
       This pathway: 112 patients (6.1%)
       Other pathways: 5.1% average
       Fold enrichment: 1.20x

  Found 10 total differentiating medications

Pathway 1 MEDICATION PATTERNS:
  Total patients: 11108

  Top 10 differentiating medications (ranked by fold enrichment):
    1. Ramipril 10mg capsules:
       This pathway: 307 patients (2.8%)
       Other pathways: 1.4% average
       Fold enrichment: 2.00x
    2. Bendroflumethiazide 2.5mg tablets:
       This pathway: 296 patients (2.7%)
       Other pathways: 1.8% average
       Fold enrichment: 1.48x
    3. Aspirin 75mg gastro-resistant tablets:
       This pathway: 344 patients (3.1%)
       Other pathways: 2.2% average
       Fold enrichment: 1.43x
    4. Simvastatin 40mg tablets:
       This pathway: 769 patients (6.9%)
       Other pathways: 8.7% average
       Fold enrichment: 0.79x
    5. Aspirin 75mg dispersible tablets:
       This pathway: 1135 patients (10.2%)
       Other pathways: 13.8% average
       Fold enrichment: 0.74x

  Found 5 total differentiating medications

Pathway 2 MEDICATION PATTERNS:
  Total patients: 4439

  Top 10 differentiating medications (ranked by fold enrichment):
    1. Salbutamol 100micrograms/dose inhaler CFC free:
       This pathway: 183 patients (4.1%)
       Other pathways: 0.0% average
       Fold enrichment: 4122550.12x
    2. Amlodipine 5mg tablets:
       This pathway: 139 patients (3.1%)
       Other pathways: 0.0% average
       Fold enrichment: 3131335.89x
    3. Bendroflumethiazide 2.5mg tablets:
       This pathway: 240 patients (5.4%)
       Other pathways: 0.9% average
       Fold enrichment: 6.09x
    4. Omeprazole 20mg gastro-resistant capsules:
       This pathway: 330 patients (7.4%)
       Other pathways: 5.0% average
       Fold enrichment: 1.47x
    5. Paracetamol 500mg tablets:
       This pathway: 318 patients (7.2%)
       Other pathways: 5.2% average
       Fold enrichment: 1.39x
    6. Amoxicillin 500mg capsules:
       This pathway: 293 patients (6.6%)
       Other pathways: 4.9% average
       Fold enrichment: 1.35x
    7. Lansoprazole 30mg gastro-resistant capsules:
       This pathway: 236 patients (5.3%)
       Other pathways: 4.0% average
       Fold enrichment: 1.34x
    8. Simvastatin 40mg tablets:
       This pathway: 314 patients (7.1%)
       Other pathways: 8.7% average
       Fold enrichment: 0.82x
    9. Aspirin 75mg dispersible tablets:
       This pathway: 416 patients (9.4%)
       Other pathways: 14.1% average
       Fold enrichment: 0.66x

  Found 9 total differentiating medications

Pathway 3 MEDICATION PATTERNS:
  Total patients: 7420

  Top 10 differentiating medications (ranked by fold enrichment):
    1. Metformin 500mg tablets:
       This pathway: 301 patients (4.1%)
       Other pathways: 0.0% average
       Fold enrichment: 4056603.77x
    2. Ramipril 10mg capsules:
       This pathway: 308 patients (4.2%)
       Other pathways: 0.9% average
       Fold enrichment: 4.51x
    3. Glyceryl trinitrate 400micrograms/dose aerosol sublingual spray:
       This pathway: 283 patients (3.8%)
       Other pathways: 2.2% average
       Fold enrichment: 1.72x
    4. Simvastatin 40mg tablets:
       This pathway: 646 patients (8.7%)
       Other pathways: 8.1% average
       Fold enrichment: 1.07x
    5. Aspirin 75mg dispersible tablets:
       This pathway: 988 patients (13.3%)
       Other pathways: 12.8% average
       Fold enrichment: 1.04x

  Found 5 total differentiating medications

=== CREATING MEDICATION-PATHWAY VISUALIZATIONS ===
No description has been provided for this image
=== MEDICATION INTEGRATION SUMMARY ===
Target Disease: myocardial infarction
Total Pathways: 4
Total Patients: 24803
Patients with Medication Data: 7902

Pathway 0:
  Patients: 1836
  With meds: 636
  Coverage: 34.6%
  Medication diversity: 9.42
  Total prescriptions: 6949

Pathway 1:
  Patients: 11108
  With meds: 3343
  Coverage: 30.1%
  Medication diversity: 6.52
  Total prescriptions: 26169

Pathway 2:
  Patients: 4439
  With meds: 1550
  Coverage: 34.9%
  Medication diversity: 8.08
  Total prescriptions: 14871

Pathway 3:
  Patients: 7420
  With meds: 2373
  Coverage: 32.0%
  Medication diversity: 7.60
  Total prescriptions: 21700

5. ANALYZING PRS DIFFERENCES BY PATHWAY

=== ANALYZING POLYGENIC RISK SCORES BY PATHWAY ===
✅ Loaded PRS data: (400000, 37)
Available PRS columns: ['PatientID', 'AAM', 'AMD', 'AD', 'AST', 'AF', 'BD', 'BMI', 'CRC', 'BC', 'CVD', 'CED', 'CAD', 'CD', 'EOC', 'EBMDT', 'HBA1C_DF', 'HEIGHT', 'HDL', 'HT', 'IOP', 'ISS', 'LDL_SF', 'MEL', 'MS', 'OP', 'PD', 'POAG', 'PC', 'PSO', 'RA', 'SCZ', 'SLE', 'T1D', 'T2D', 'UC', 'VTE']
Pathway patient IDs: [9, 87, 90, 93, 146]...
Corresponding eids: [1000113, 1001140, 1001175, 1001215, 1001881]...
Found PRS data for 24803 pathway patients
Analyzing 36 PRS scores across 4 pathways

PRS DIFFERENCES BY PATHWAY:
Top 10 most discriminating PRS scores:

1. CAD (variance: 0.0826):
   Pathway 0: 0.906 ± 0.941 (n=1836)
   Pathway 1: 0.158 ± 0.963 (n=11108)
   Pathway 2: 0.450 ± 0.977 (n=4439)
   Pathway 3: 0.752 ± 0.963 (n=7420)
   ANOVA: F=719.857, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0000
     Pathway 0 vs 3: p=0.0000

2. CVD (variance: 0.0738):
   Pathway 0: 0.846 ± 0.958 (n=1836)
   Pathway 1: 0.136 ± 0.955 (n=11108)
   Pathway 2: 0.432 ± 0.986 (n=4439)
   Pathway 3: 0.703 ± 0.956 (n=7420)
   ANOVA: F=657.687, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0000
     Pathway 0 vs 3: p=0.0000

3. ISS (variance: 0.0298):
   Pathway 0: 0.475 ± 0.992 (n=1836)
   Pathway 1: 0.023 ± 0.963 (n=11108)
   Pathway 2: 0.338 ± 0.991 (n=4439)
   Pathway 3: 0.405 ± 0.985 (n=7420)
   ANOVA: F=301.521, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0000
     Pathway 0 vs 3: p=0.0068

4. HT (variance: 0.0284):
   Pathway 0: 0.448 ± 0.990 (n=1836)
   Pathway 1: 0.010 ± 0.969 (n=11108)
   Pathway 2: 0.344 ± 0.981 (n=4439)
   Pathway 3: 0.377 ± 0.983 (n=7420)
   ANOVA: F=291.302, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0001
     Pathway 0 vs 3: p=0.0059

5. T2D (variance: 0.0248):
   Pathway 0: 0.323 ± 0.988 (n=1836)
   Pathway 1: -0.047 ± 0.971 (n=11108)
   Pathway 2: 0.194 ± 0.970 (n=4439)
   Pathway 3: 0.352 ± 0.997 (n=7420)
   ANOVA: F=275.924, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0000
     Pathway 1 vs 2: p=0.0000

6. LDL_SF (variance: 0.0187):
   Pathway 0: 0.352 ± 0.949 (n=1836)
   Pathway 1: -0.007 ± 0.996 (n=11108)
   Pathway 2: 0.109 ± 0.988 (n=4439)
   Pathway 3: 0.252 ± 0.965 (n=7420)
   ANOVA: F=140.867, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0000
     Pathway 0 vs 3: p=0.0001

7. BMI (variance: 0.0130):
   Pathway 0: 0.228 ± 1.013 (n=1836)
   Pathway 1: -0.052 ± 1.000 (n=11108)
   Pathway 2: 0.201 ± 1.005 (n=4439)
   Pathway 3: 0.199 ± 0.988 (n=7420)
   ANOVA: F=134.814, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 1 vs 2: p=0.0000
     Pathway 1 vs 3: p=0.0000

8. HEIGHT (variance: 0.0067):
   Pathway 0: -0.181 ± 1.049 (n=1836)
   Pathway 1: 0.026 ± 0.956 (n=11108)
   Pathway 2: -0.020 ± 0.930 (n=4439)
   Pathway 3: -0.126 ± 1.045 (n=7420)
   ANOVA: F=48.117, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0000
     Pathway 0 vs 3: p=0.0419

9. AST (variance: 0.0053):
   Pathway 0: 0.032 ± 0.985 (n=1836)
   Pathway 1: -0.025 ± 0.985 (n=11108)
   Pathway 2: 0.174 ± 0.999 (n=4439)
   Pathway 3: 0.054 ± 1.002 (n=7420)
   ANOVA: F=43.572, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0202
     Pathway 0 vs 2: p=0.0000
     Pathway 1 vs 2: p=0.0000

10. HDL (variance: 0.0046):
   Pathway 0: -0.181 ± 0.965 (n=1836)
   Pathway 1: -0.015 ± 0.991 (n=11108)
   Pathway 2: -0.106 ± 0.989 (n=4439)
   Pathway 3: -0.180 ± 0.993 (n=7420)
   ANOVA: F=46.804, p=0.0000
   Significant pairwise differences:
     Pathway 0 vs 1: p=0.0000
     Pathway 0 vs 2: p=0.0063
     Pathway 1 vs 2: p=0.0000
   Saved plot: complete_pathway_analysis_output/ukb_pathway_discovery/prs_by_pathway.pdf
No description has been provided for this image
6. ANALYZING GRANULAR DISEASE PATTERNS

=== ANALYZING GRANULAR DISEASE PATTERNS BY PATHWAY ===
Including diseases with ≥1.0% prevalence in at least one pathway

DISEASES THAT DIFFERENTIATE PATHWAYS (including rare diseases):
Found 236 diseases with sufficient prevalence
Top 20 diseases that differentiate pathways (including rare diseases):

1. Coronary atherosclerosis (max prevalence: 86.3%):
   Pathway 0: 1584 patients (86.3%)
   Pathway 1: 927 patients (8.3%)
   Pathway 2: 902 patients (20.3%)
   Pathway 3: 1183 patients (15.9%)

2. Angina pectoris (max prevalence: 75.0%):
   Pathway 0: 1377 patients (75.0%)
   Pathway 1: 823 patients (7.4%)
   Pathway 2: 1014 patients (22.8%)
   Pathway 3: 915 patients (12.3%)

3. Other chronic ischemic heart disease, unspecified (max prevalence: 74.7%):
   Pathway 0: 1371 patients (74.7%)
   Pathway 1: 982 patients (8.8%)
   Pathway 2: 1048 patients (23.6%)
   Pathway 3: 935 patients (12.6%)

4. Hypercholesterolemia (max prevalence: 75.9%):
   Pathway 0: 1393 patients (75.9%)
   Pathway 1: 1105 patients (9.9%)
   Pathway 2: 1320 patients (29.7%)
   Pathway 3: 1245 patients (16.8%)

5. Essential hypertension (max prevalence: 74.8%):
   Pathway 0: 1374 patients (74.8%)
   Pathway 1: 2308 patients (20.8%)
   Pathway 2: 2896 patients (65.2%)
   Pathway 3: 2066 patients (27.8%)

6. Unstable angina (intermediate coronary syndrome) (max prevalence: 35.2%):
   Pathway 0: 647 patients (35.2%)
   Pathway 1: 309 patients (2.8%)
   Pathway 2: 323 patients (7.3%)
   Pathway 3: 360 patients (4.9%)

7. Arthropathy NOS (max prevalence: 35.4%):
   Pathway 0: 330 patients (18.0%)
   Pathway 1: 891 patients (8.0%)
   Pathway 2: 1570 patients (35.4%)
   Pathway 3: 559 patients (7.5%)

8. Diaphragmatic hernia (max prevalence: 26.0%):
   Pathway 0: 240 patients (13.1%)
   Pathway 1: 571 patients (5.1%)
   Pathway 2: 1156 patients (26.0%)
   Pathway 3: 399 patients (5.4%)

9. Diverticulosis (max prevalence: 26.3%):
   Pathway 0: 223 patients (12.1%)
   Pathway 1: 708 patients (6.4%)
   Pathway 2: 1166 patients (26.3%)
   Pathway 3: 351 patients (4.7%)

10. Type 2 diabetes (max prevalence: 26.2%):
   Pathway 0: 481 patients (26.2%)
   Pathway 1: 759 patients (6.8%)
   Pathway 2: 825 patients (18.6%)
   Pathway 3: 989 patients (13.3%)

11. Asthma (max prevalence: 22.3%):
   Pathway 0: 197 patients (10.7%)
   Pathway 1: 506 patients (4.6%)
   Pathway 2: 992 patients (22.3%)
   Pathway 3: 445 patients (6.0%)

12. Benign neoplasm of colon (max prevalence: 20.2%):
   Pathway 0: 174 patients (9.5%)
   Pathway 1: 487 patients (4.4%)
   Pathway 2: 897 patients (20.2%)
   Pathway 3: 301 patients (4.1%)

13. GERD (max prevalence: 18.8%):
   Pathway 0: 159 patients (8.7%)
   Pathway 1: 313 patients (2.8%)
   Pathway 2: 833 patients (18.8%)
   Pathway 3: 251 patients (3.4%)

14. Gastritis and duodenitis (max prevalence: 17.7%):
   Pathway 0: 165 patients (9.0%)
   Pathway 1: 370 patients (3.3%)
   Pathway 2: 784 patients (17.7%)
   Pathway 3: 304 patients (4.1%)

15. Obesity (max prevalence: 15.8%):
   Pathway 0: 216 patients (11.8%)
   Pathway 1: 243 patients (2.2%)
   Pathway 2: 703 patients (15.8%)
   Pathway 3: 303 patients (4.1%)

16. Hemorrhoids (max prevalence: 17.4%):
   Pathway 0: 173 patients (9.4%)
   Pathway 1: 433 patients (3.9%)
   Pathway 2: 772 patients (17.4%)
   Pathway 3: 330 patients (4.4%)

17. Hyperlipidemia (max prevalence: 14.5%):
   Pathway 0: 267 patients (14.5%)
   Pathway 1: 168 patients (1.5%)
   Pathway 2: 343 patients (7.7%)
   Pathway 3: 217 patients (2.9%)

18. Atrial fibrillation and flutter (max prevalence: 17.6%):
   Pathway 0: 324 patients (17.6%)
   Pathway 1: 545 patients (4.9%)
   Pathway 2: 463 patients (10.4%)
   Pathway 3: 435 patients (5.9%)

19. Hypothyroidism NOS (max prevalence: 11.9%):
   Pathway 0: 91 patients (5.0%)
   Pathway 1: 268 patients (2.4%)
   Pathway 2: 529 patients (11.9%)
   Pathway 3: 157 patients (2.1%)

20. Major depressive disorder (max prevalence: 12.4%):
   Pathway 0: 110 patients (6.0%)
   Pathway 1: 251 patients (2.3%)
   Pathway 2: 549 patients (12.4%)
   Pathway 3: 271 patients (3.7%)

=== DISEASE CATEGORY ANALYSIS ===

CARDIOVASCULAR DISEASES:
  1. Coronary atherosclerosis (variance: 0.0974)
     Pathway 0: 86.3%
     Pathway 1: 8.3%
     Pathway 2: 20.3%
     Pathway 3: 15.9%
  2. Angina pectoris (variance: 0.0724)
     Pathway 0: 75.0%
     Pathway 1: 7.4%
     Pathway 2: 22.8%
     Pathway 3: 12.3%
  3. Other chronic ischemic heart disease, unspecified (variance: 0.0697)
     Pathway 0: 74.7%
     Pathway 1: 8.8%
     Pathway 2: 23.6%
     Pathway 3: 12.6%
  4. Essential hypertension (variance: 0.0541)
     Pathway 0: 74.8%
     Pathway 1: 20.8%
     Pathway 2: 65.2%
     Pathway 3: 27.8%
  5. Unstable angina (intermediate coronary syndrome) (variance: 0.0174)
     Pathway 0: 35.2%
     Pathway 1: 2.8%
     Pathway 2: 7.3%
     Pathway 3: 4.9%

METABOLIC DISEASES:
  1. Hypercholesterolemia (variance: 0.0661)
     Pathway 0: 75.9%
     Pathway 1: 9.9%
     Pathway 2: 29.7%
     Pathway 3: 16.8%
  2. Type 2 diabetes (variance: 0.0050)
     Pathway 0: 26.2%
     Pathway 1: 6.8%
     Pathway 2: 18.6%
     Pathway 3: 13.3%
  3. Obesity (variance: 0.0031)
     Pathway 0: 11.8%
     Pathway 1: 2.2%
     Pathway 2: 15.8%
     Pathway 3: 4.1%
  4. Type 1 diabetes (variance: 0.0001)
     Pathway 0: 3.4%
     Pathway 1: 1.2%
     Pathway 2: 2.7%
     Pathway 3: 3.0%
  5. Type 2 diabetes with ophthalmic manifestations (variance: 0.0000)
     Pathway 0: 1.6%
     Pathway 1: 0.5%
     Pathway 2: 0.9%
     Pathway 3: 2.4%

RHEUMATOLOGIC DISEASES:
  1. Arthropathy NOS (variance: 0.0127)
     Pathway 0: 18.0%
     Pathway 1: 8.0%
     Pathway 2: 35.4%
     Pathway 3: 7.5%
  2. Osteoarthritis; localized (variance: 0.0006)
     Pathway 0: 2.6%
     Pathway 1: 2.1%
     Pathway 2: 7.8%
     Pathway 3: 1.5%
  3. Rheumatoid arthritis (variance: 0.0004)
     Pathway 0: 2.0%
     Pathway 1: 1.2%
     Pathway 2: 5.8%
     Pathway 3: 1.1%
  4. Psoriasis vulgaris (variance: 0.0000)
     Pathway 0: 0.9%
     Pathway 1: 0.5%
     Pathway 2: 1.3%
     Pathway 3: 0.8%

NEOPLASTIC DISEASES:
  1. Benign neoplasm of colon (variance: 0.0043)
     Pathway 0: 9.5%
     Pathway 1: 4.4%
     Pathway 2: 20.2%
     Pathway 3: 4.1%
  2. Benign neoplasm of other parts of digestive system (variance: 0.0005)
     Pathway 0: 1.9%
     Pathway 1: 0.8%
     Pathway 2: 6.1%
     Pathway 3: 0.5%
  3. Other non-epithelial cancer of skin (variance: 0.0001)
     Pathway 0: 3.8%
     Pathway 1: 3.5%
     Pathway 2: 4.3%
     Pathway 3: 1.8%
  4. Colon cancer (variance: 0.0001)
     Pathway 0: 1.2%
     Pathway 1: 0.5%
     Pathway 2: 2.4%
     Pathway 3: 0.4%
  5. Malignant neoplasm of female breast (variance: 0.0000)
     Pathway 0: 0.6%
     Pathway 1: 1.7%
     Pathway 2: 2.0%
     Pathway 3: 0.5%

RESPIRATORY DISEASES:
  1. Asthma (variance: 0.0049)
     Pathway 0: 10.7%
     Pathway 1: 4.6%
     Pathway 2: 22.3%
     Pathway 3: 6.0%
  2. Chronic airway obstruction (variance: 0.0009)
     Pathway 0: 8.3%
     Pathway 1: 3.1%
     Pathway 2: 10.2%
     Pathway 3: 3.7%
  3. Other diseases of respiratory system, NEC (variance: 0.0004)
     Pathway 0: 6.8%
     Pathway 1: 2.1%
     Pathway 2: 6.8%
     Pathway 3: 4.4%
  4. Pneumococcal pneumonia (variance: 0.0002)
     Pathway 0: 3.4%
     Pathway 1: 1.7%
     Pathway 2: 5.2%
     Pathway 3: 2.9%
  5. Obstructive chronic bronchitis (variance: 0.0001)
     Pathway 0: 2.4%
     Pathway 1: 1.2%
     Pathway 2: 3.9%
     Pathway 3: 1.8%

GASTROINTESTINAL DISEASES:
  1. Diaphragmatic hernia (variance: 0.0072)
     Pathway 0: 13.1%
     Pathway 1: 5.1%
     Pathway 2: 26.0%
     Pathway 3: 5.4%
  2. Diverticulosis (variance: 0.0072)
     Pathway 0: 12.1%
     Pathway 1: 6.4%
     Pathway 2: 26.3%
     Pathway 3: 4.7%
  3. Ulcer of esophagus (variance: 0.0005)
     Pathway 0: 2.7%
     Pathway 1: 0.8%
     Pathway 2: 6.4%
     Pathway 3: 1.1%
  4. Hemorrhage of gastrointestinal tract (variance: 0.0003)
     Pathway 0: 2.7%
     Pathway 1: 1.0%
     Pathway 2: 5.4%
     Pathway 3: 1.1%
  5. Gastric ulcer (variance: 0.0002)
     Pathway 0: 2.8%
     Pathway 1: 0.9%
     Pathway 2: 4.5%
     Pathway 3: 1.1%

NEUROLOGICAL DISEASES:
  1. Migraine (variance: 0.0001)
     Pathway 0: 1.1%
     Pathway 1: 0.4%
     Pathway 2: 2.7%
     Pathway 3: 0.3%
  2. Epilepsy, recurrent seizures, convulsions (variance: 0.0000)
     Pathway 0: 1.7%
     Pathway 1: 1.0%
     Pathway 2: 2.1%
     Pathway 3: 1.1%

INFECTIOUS DISEASES:
  1. Urinary tract infection (variance: 0.0004)
     Pathway 0: 4.8%
     Pathway 1: 3.0%
     Pathway 2: 8.4%
     Pathway 3: 5.1%
  2. Pneumococcal pneumonia (variance: 0.0002)
     Pathway 0: 3.4%
     Pathway 1: 1.7%
     Pathway 2: 5.2%
     Pathway 3: 2.9%
  3. Bacterial infection NOS (variance: 0.0001)
     Pathway 0: 3.8%
     Pathway 1: 1.6%
     Pathway 2: 4.8%
     Pathway 3: 3.2%
  4. Sepsis (variance: 0.0001)
     Pathway 0: 1.1%
     Pathway 1: 0.6%
     Pathway 2: 2.3%
     Pathway 3: 2.3%
  5. Pneumonia (variance: 0.0000)
     Pathway 0: 2.1%
     Pathway 1: 1.3%
     Pathway 2: 3.1%
     Pathway 3: 2.3%

RENAL DISEASES:
  1. Acute renal failure (variance: 0.0002)
     Pathway 0: 3.9%
     Pathway 1: 1.5%
     Pathway 2: 5.7%
     Pathway 3: 4.2%
  2. Chronic Kidney Disease, Stage III (variance: 0.0001)
     Pathway 0: 1.7%
     Pathway 1: 1.2%
     Pathway 2: 3.7%
     Pathway 3: 1.2%
  3. Chronic renal failure [CKD] (variance: 0.0001)
     Pathway 0: 3.4%
     Pathway 1: 1.4%
     Pathway 2: 3.4%
     Pathway 3: 2.7%
  4. Hypertensive chronic kidney disease (variance: 0.0000)
     Pathway 0: 2.7%
     Pathway 1: 0.8%
     Pathway 2: 1.6%
     Pathway 3: 1.4%
  5. Cyst of kidney, acquired (variance: 0.0000)
     Pathway 0: 0.9%
     Pathway 1: 0.6%
     Pathway 2: 1.6%
     Pathway 3: 0.6%

ENDOCRINE DISEASES:
  1. Hypothyroidism NOS (variance: 0.0016)
     Pathway 0: 5.0%
     Pathway 1: 2.4%
     Pathway 2: 11.9%
     Pathway 3: 2.1%

OPHTHALMIC DISEASES:
  1. Cataract (variance: 0.0002)
     Pathway 0: 7.7%
     Pathway 1: 5.4%
     Pathway 2: 8.4%
     Pathway 3: 5.1%
  2. Senile cataract (variance: 0.0001)
     Pathway 0: 5.6%
     Pathway 1: 3.5%
     Pathway 2: 5.9%
     Pathway 3: 3.2%
  3. Diabetic retinopathy (variance: 0.0001)
     Pathway 0: 2.0%
     Pathway 1: 0.7%
     Pathway 2: 1.4%
     Pathway 3: 2.9%
  4. Ptosis of eyelid (variance: 0.0000)
     Pathway 0: 0.3%
     Pathway 1: 0.4%
     Pathway 2: 1.2%
     Pathway 3: 0.5%
  5. Glaucoma (variance: 0.0000)
     Pathway 0: 1.7%
     Pathway 1: 1.5%
     Pathway 2: 2.0%
     Pathway 3: 1.2%

DERMATOLOGIC DISEASES:
  1. Other non-epithelial cancer of skin (variance: 0.0001)
     Pathway 0: 3.8%
     Pathway 1: 3.5%
     Pathway 2: 4.3%
     Pathway 3: 1.8%
  2. Chronic ulcer of skin (variance: 0.0000)
     Pathway 0: 0.9%
     Pathway 1: 0.4%
     Pathway 2: 1.4%
     Pathway 3: 1.7%
  3. Lipoma of skin and subcutaneous tissue (variance: 0.0000)
     Pathway 0: 1.7%
     Pathway 1: 1.0%
     Pathway 2: 2.1%
     Pathway 3: 1.0%
  4. Benign neoplasm of skin (variance: 0.0000)
     Pathway 0: 1.5%
     Pathway 1: 1.7%
     Pathway 2: 2.4%
     Pathway 3: 1.2%
  5. Atopic/contact dermatitis due to other or unspecified (variance: 0.0000)
     Pathway 0: 0.7%
     Pathway 1: 0.3%
     Pathway 2: 1.4%
     Pathway 3: 0.6%

8. SAVING RESULTS
   Saved complete results to: complete_pathway_analysis_output/ukb_pathway_discovery/complete_analysis_results.pkl
   Saved summary to: complete_pathway_analysis_output/ukb_pathway_discovery/analysis_summary.txt

✅ Complete pathway analysis for myocardial infarction finished!
   All results saved to: complete_pathway_analysis_output/ukb_pathway_discovery/
   Complete log saved to: complete_pathway_analysis_output/ukb_pathway_discovery/complete_analysis_log.txt

✅ UKB pathway discovery complete

STEP 2: Transition Analysis (UKB vs MGB)¶

Analyze transitions from precursor disease to target disease:

In [4]:
# ============================================================================
# STEP 2: TRANSITION ANALYSIS (UKB vs MGB)
# ============================================================================
# Analyze transitions from precursor disease to target disease across cohorts

print("\n" + "="*80)
print("STEP 2: TRANSITION ANALYSIS")
print("="*80)

results['transition'] = run_transition_analysis_both_cohorts(
    transition_disease_name=transition_disease,
    target_disease_name=target_disease,
    years_before=lookback_years,
    age_tolerance=5,
    min_followup=5,
    mgb_model_path=mgb_model_path,
    output_dir=os.path.join(output_dir, 'transition_analysis')
)

print("\n✅ Transition analysis complete")
================================================================================
STEP 2: TRANSITION ANALYSIS
================================================================================
================================================================================
TRANSITION ANALYSIS: UKB vs MGB
================================================================================
Precursor: Rheumatoid arthritis
Target: myocardial infarction
Years before: 10
================================================================================

================================================================================
STEP 1: LOADING UKB DATA
================================================================================
Loading full dataset...
Loaded Y (full): torch.Size([407878, 348, 52])
Loaded thetas: (400000, 21, 52)
Loaded 400000 processed IDs
Subset Y to first 400K patients: torch.Size([400000, 348, 52])
Loaded 348 diseases
Total patients with complete data: 400000
✅ UKB data loaded:
   Y shape: torch.Size([400000, 348, 52])
   Thetas shape: (400000, 21, 52)
   Diseases: 348

🔍 Checking if diseases exist in UKB...

Searching for: 'Rheumatoid arthritis'
Found 2 potential matches:
  [297] Rheumatoid arthritis                                         (score: 100.0, type: exact_substring)
  [325] Osteoarthritis; localized                                    (score: 30.0, type: single_term)

Searching for: 'myocardial infarction'
Found 2 potential matches:
  [112] Myocardial infarction                                        (score: 100.0, type: exact_substring)
  [135] Cerebral artery occlusion, with cerebral infarction          (score: 70.0, type: partial_terms)

✅ Using UKB disease names:
   Transition: 'Rheumatoid arthritis'
   Target: 'Myocardial infarction'

================================================================================
STEP 2: RUNNING UKB TRANSITION ANALYSIS
================================================================================

================================================================================
BC PROGRESSION ANALYSIS (MATCHED ON AGE AT BC DIAGNOSIS)
  Precursor: Rheumatoid arthritis
  Target: Myocardial infarction
  Plotting: 10 years leading up to MI
  Age tolerance: ±5 years
================================================================================

Found target disease: Myocardial infarction (index 112)
Found transition disease: Rheumatoid arthritis (index 297)
Population reference shape: (21, 52)
Found 509 BC patients who develop MI
Found 7796 BC patients who DON'T develop MI

=== AGE MATCHING AT BC DIAGNOSIS ===
Found 509 age-matched pairs
Average time from BC to MI: 7 years

Analyzing 10 years
Progressors (BC→MI): 509 patients
Non-progressors (BC only): 507 patients

Plot saved as 'bc_progression_Rheumatoid_arthritis_matched_on_bc_age.png'

================================================================================
VERIFICATION: Signature 5 Deviations
================================================================================

Signature 5 - BC patients who develop MI:
  Values (years -10 to -1): [0.00614689 0.00846632 0.01288823 0.01997185 0.03019964 0.04367169
 0.06005038 0.07815162 0.0955545  0.11044265]
  At -2 years: 0.0956

Signature 5 - BC patients who DON'T develop MI:
  Values (years -10 to -1): [-0.01400137 -0.01572944 -0.01779694 -0.01983037 -0.02150985 -0.02255172
 -0.02284858 -0.02249413 -0.02166112 -0.02047591]
  At -2 years: -0.0217

Comparison:
  BC→MI has SIGNATURE 5 at -2 years: 0.0956
  BC only has SIGNATURE 5 at -2 years: -0.0217
  Difference (MI minus no-MI): 0.1172
  ✓ BC patients who develop MI have HIGHER signature 5
================================================================================
No description has been provided for this image
Line plot saved as 'bc_progression_line_plot_Rheumatoid_arthritis.png'
No description has been provided for this image
================================================================================
STEP 3: LOADING MGB DATA
================================================================================
================================================================================
LOADING MGB MODEL DATA
================================================================================

Loading MGB model from: /Users/sarahurbut/Dropbox-Personal/model_with_kappa_bigam_MGB.pt
Model keys: ['model_state_dict', 'clusters', 'psi', 'Y', 'prevalence_t', 'logit_prevalence_t', 'G', 'E', 'disease_names', 'hyperparameters']

MGB Data shapes:
  Y: (34592, 346, 51)
  Lambda: (34592, 21, 51)
  Thetas: (34592, 21, 51)
  Disease names: 346 diseases
✅ MGB data loaded:
   Y shape: torch.Size([34592, 346, 51])
   Thetas shape: (34592, 21, 51)
   Diseases: 346

🔍 Checking if diseases exist in MGB...

Searching for: 'Rheumatoid arthritis'
Found 2 potential matches:
  [295] Rheumatoid arthritis                                         (score: 100.0, type: exact_substring)
  [323] Osteoarthritis; localized                                    (score: 30.0, type: single_term)

Searching for: 'myocardial infarction'
Found 2 potential matches:
  [112] Myocardial infarction                                        (score: 100.0, type: exact_substring)
  [133] Cerebral artery occlusion, with cerebral infarction          (score: 70.0, type: partial_terms)

✅ Using MGB disease names:
   Transition: 'Rheumatoid arthritis'
   Target: 'Myocardial infarction'

================================================================================
STEP 4: RUNNING MGB TRANSITION ANALYSIS
================================================================================

================================================================================
BC PROGRESSION ANALYSIS (MATCHED ON AGE AT BC DIAGNOSIS)
  Precursor: Rheumatoid arthritis
  Target: Myocardial infarction
  Plotting: 10 years leading up to MI
  Age tolerance: ±5 years
================================================================================

Found target disease: Myocardial infarction (index 112)
Found transition disease: Rheumatoid arthritis (index 295)
Population reference shape: (21, 51)
Found 99 BC patients who develop MI
Found 1347 BC patients who DON'T develop MI

=== AGE MATCHING AT BC DIAGNOSIS ===
Found 99 age-matched pairs
Average time from BC to MI: 3 years

Analyzing 10 years
Progressors (BC→MI): 97 patients
Non-progressors (BC only): 97 patients

Plot saved as 'bc_progression_Rheumatoid_arthritis_matched_on_bc_age.png'

================================================================================
VERIFICATION: Signature 5 Deviations
================================================================================

Signature 5 - BC patients who develop MI:
  Values (years -10 to -1): [-0.00708867 -0.00622161 -0.00363407  0.00100465  0.00783178  0.01666135
  0.027064    0.03868554  0.05087995  0.06351656]
  At -2 years: 0.0509

Signature 5 - BC patients who DON'T develop MI:
  Values (years -10 to -1): [-0.00811616 -0.01056505 -0.01302919 -0.01531091 -0.01721572 -0.01860621
 -0.01907601 -0.01868    -0.01778725 -0.01650405]
  At -2 years: -0.0178

Comparison:
  BC→MI has SIGNATURE 5 at -2 years: 0.0509
  BC only has SIGNATURE 5 at -2 years: -0.0178
  Difference (MI minus no-MI): 0.0687
  ✓ BC patients who develop MI have HIGHER signature 5
================================================================================
No description has been provided for this image
Line plot saved as 'bc_progression_line_plot_Rheumatoid_arthritis.png'
No description has been provided for this image
================================================================================
STEP 5: COMPARING UKB vs MGB RESULTS
================================================================================

📊 SUMMARY STATISTICS:
--------------------------------------------------------------------------------

Sample Sizes:
  UKB: 509 progressors, 507 non-progressors
       509 matched pairs
  MGB: 97 progressors, 97 non-progressors
       99 matched pairs

Signature Trajectories:
  UKB shape: (21, 10)
  MGB shape: (21, 10)

  Comparing 21 signatures...
  Average signature trajectory correlation: 0.489

  Top 5 most similar signatures (by trajectory correlation):
    Signature 5: correlation = 0.999
    Signature 17: correlation = 0.982
    Signature 18: correlation = 0.975
    Signature 3: correlation = 0.954
    Signature 11: correlation = 0.950

  📊 SIGNATURE 3 DETAILED COMPARISON:
  --------------------------------------------------------------------------------

  UKB:
    Progressors (RA → MI):     mean = -0.0146
    Non-progressors (RA only): mean = -0.0165
    Difference (NP - Prog):     -0.0019
    Pattern: Prog > NP

  MGB:
    Progressors (RA → MI):     mean = -0.0081
    Non-progressors (RA only): mean = -0.0025
    Difference (NP - Prog):     +0.0056
    Pattern: NP > Prog

  Pattern Consistency: ❌ DIFFERENT

  Absolute Levels (Progressors):
    UKB: -0.0146
    MGB: -0.0081
    Ratio (MGB/UKB): 0.55

  Absolute Levels (Non-progressors):
    UKB: -0.0165
    MGB: -0.0025
    Ratio (MGB/UKB): 0.15

================================================================================
STEP 6: CREATING SIDE-BY-SIDE COMPARISON
================================================================================
✅ Saved comparison plot: complete_pathway_analysis_output/transition_analysis/ukb_mgb_comparison_rheumatoid_arthritis_to_myocardial_infarction.png

================================================================================
✅ TRANSITION ANALYSIS COMPLETE!
================================================================================

Results saved to: complete_pathway_analysis_output/transition_analysis/

✅ Transition analysis complete

STEP 3: Signature 5 Analysis with FH Carriers¶

Analyze Signature 5 patterns by pathway with familial hypercholesterolemia carriers:

In [5]:
# ============================================================================
# STEP 3: SIGNATURE 5 ANALYSIS
# ============================================================================
# Analyze Signature 5 patterns by pathway with FH (familial hypercholesterolemia) carriers

print("\n" + "="*80)
print("STEP 3: SIGNATURE 5 ANALYSIS")
print("="*80)

results['sig5'] = analyze_signature5_by_pathway(
    target_disease=target_disease,
    output_dir=os.path.join(output_dir, 'ukb_pathway_discovery'),
    fh_carrier_path='/Users/sarahurbut/Downloads/out/ukb_exome_450k_fh.carrier.txt'
)

print("\n✅ Signature 5 analysis complete")
================================================================================
STEP 3: SIGNATURE 5 ANALYSIS
================================================================================
================================================================================
ANALYZING SIGNATURE 5 IN EACH MI PATHWAY
================================================================================
Loading full dataset...
Loaded Y (full): torch.Size([407878, 348, 52])
Loaded thetas: (400000, 21, 52)
Loaded 400000 processed IDs
Subset Y to first 400K patients: torch.Size([400000, 348, 52])
Loaded 348 diseases
Total patients with complete data: 400000

Analyzing 4 pathways...

Loading FH carrier data from: /Users/sarahurbut/Downloads/out/ukb_exome_450k_fh.carrier.txt
  ✅ Loaded 2,564 FH carriers
Found precursor disease indices:
  coronary atherosclerosis: index 114
  hypercholesterolemia: index 52
  angina: index 111
  hypertension: index 109
  diabetes: index 46
  obesity: index 60

============================================================
PATHWAY 0
============================================================
Number of patients: 1836

Precursor Disease Prevalence (BEFORE MI):
  coronary atherosclerosis: 1584 (86.3%)
  hypercholesterolemia: 1393 (75.9%)
  angina: 647 (35.2%)
  hypertension: 1374 (74.8%)
  diabetes: 63 (3.4%)
  obesity: 216 (11.8%)

Reference population prevalence up to average MI age (≈ 68.9y):
  coronary atherosclerosis: ref 5.4%, pathway 86.3% (Δ +80.9 pp)
  hypercholesterolemia: ref 9.5%, pathway 75.9% (Δ +66.4 pp)
  angina: ref 1.4%, pathway 35.2% (Δ +33.8 pp)
  hypertension: ref 22.5%, pathway 74.8% (Δ +52.3 pp)
  diabetes: ref 0.8%, pathway 3.4% (Δ +2.6 pp)
  obesity: ref 5.1%, pathway 11.8% (Δ +6.6 pp)

Signature 5 Deviations (ages 60-70):
  Average deviation: +0.1800
  Max deviation: 0.1969

FH Carrier Prevalence:
  Carriers: 19/1836 (1.03%)
  Non-carriers: 1817/1836 (0.99%)
  Overall population carrier rate: 0.54%
  Enrichment ratio: 1.92x

Summary:
  Average precursor prevalence: 47.9%
  Signature 5 deviation: +0.1800

  ✓ High precursor prevalence (47.9%)
     and elevated signature 5 (+0.1800)
     INTERPRETATION: Classic atherosclerosis pathway

============================================================
PATHWAY 1
============================================================
Number of patients: 11108

Precursor Disease Prevalence (BEFORE MI):
  coronary atherosclerosis: 927 (8.3%)
  hypercholesterolemia: 1105 (9.9%)
  angina: 309 (2.8%)
  hypertension: 2308 (20.8%)
  diabetes: 129 (1.2%)
  obesity: 243 (2.2%)

Reference population prevalence up to average MI age (≈ 65.4y):
  coronary atherosclerosis: ref 4.3%, pathway 8.3% (Δ +4.1 pp)
  hypercholesterolemia: ref 7.2%, pathway 9.9% (Δ +2.8 pp)
  angina: ref 1.2%, pathway 2.8% (Δ +1.6 pp)
  hypertension: ref 17.5%, pathway 20.8% (Δ +3.2 pp)
  diabetes: ref 0.7%, pathway 1.2% (Δ +0.5 pp)
  obesity: ref 4.1%, pathway 2.2% (Δ -1.9 pp)

Signature 5 Deviations (ages 60-70):
  Average deviation: +0.0555
  Max deviation: 0.0574

FH Carrier Prevalence:
  Carriers: 84/11108 (0.76%)
  Non-carriers: 11024/11108 (0.99%)
  Overall population carrier rate: 0.54%
  Enrichment ratio: 1.40x

Summary:
  Average precursor prevalence: 7.5%
  Signature 5 deviation: +0.0555

  ⚠️  KEY FINDING: Low precursor prevalence (7.5%)
     BUT signature 5 still elevated (+0.0555)
     INTERPRETATION: Subclinical cardiovascular risk detected!

============================================================
PATHWAY 2
============================================================
Number of patients: 4439

Precursor Disease Prevalence (BEFORE MI):
  coronary atherosclerosis: 902 (20.3%)
  hypercholesterolemia: 1320 (29.7%)
  angina: 323 (7.3%)
  hypertension: 2896 (65.2%)
  diabetes: 118 (2.7%)
  obesity: 703 (15.8%)

Reference population prevalence up to average MI age (≈ 70.1y):
  coronary atherosclerosis: ref 6.1%, pathway 20.3% (Δ +14.2 pp)
  hypercholesterolemia: ref 11.2%, pathway 29.7% (Δ +18.5 pp)
  angina: ref 1.6%, pathway 7.3% (Δ +5.7 pp)
  hypertension: ref 25.9%, pathway 65.2% (Δ +39.3 pp)
  diabetes: ref 0.9%, pathway 2.7% (Δ +1.8 pp)
  obesity: ref 5.9%, pathway 15.8% (Δ +9.9 pp)

Signature 5 Deviations (ages 60-70):
  Average deviation: +0.0443
  Max deviation: 0.0522

FH Carrier Prevalence:
  Carriers: 28/4439 (0.63%)
  Non-carriers: 4411/4439 (0.99%)
  Overall population carrier rate: 0.54%
  Enrichment ratio: 1.17x

Summary:
  Average precursor prevalence: 23.5%
  Signature 5 deviation: +0.0443

  ✓ High precursor prevalence (23.5%)
     and elevated signature 5 (+0.0443)
     INTERPRETATION: Classic atherosclerosis pathway

============================================================
PATHWAY 3
============================================================
Number of patients: 7420

Precursor Disease Prevalence (BEFORE MI):
  coronary atherosclerosis: 1183 (15.9%)
  hypercholesterolemia: 1245 (16.8%)
  angina: 360 (4.9%)
  hypertension: 2066 (27.8%)
  diabetes: 222 (3.0%)
  obesity: 303 (4.1%)

Reference population prevalence up to average MI age (≈ 62.2y):
  coronary atherosclerosis: ref 3.1%, pathway 15.9% (Δ +12.8 pp)
  hypercholesterolemia: ref 5.0%, pathway 16.8% (Δ +11.8 pp)
  angina: ref 1.0%, pathway 4.9% (Δ +3.9 pp)
  hypertension: ref 12.7%, pathway 27.8% (Δ +15.2 pp)
  diabetes: ref 0.6%, pathway 3.0% (Δ +2.4 pp)
  obesity: ref 3.1%, pathway 4.1% (Δ +1.0 pp)

Signature 5 Deviations (ages 60-70):
  Average deviation: +0.1257
  Max deviation: 0.1424

FH Carrier Prevalence:
  Carriers: 74/7420 (1.00%)
  Non-carriers: 7346/7420 (0.99%)
  Overall population carrier rate: 0.54%
  Enrichment ratio: 1.85x

Summary:
  Average precursor prevalence: 12.1%
  Signature 5 deviation: +0.1257

  ⚠️  KEY FINDING: Low precursor prevalence (12.1%)
     BUT signature 5 still elevated (+0.1257)
     INTERPRETATION: Subclinical cardiovascular risk detected!

================================================================================
FH CARRIER PREVALENCE SUMMARY ACROSS PATHWAYS
================================================================================
Pathway    N        Carriers     Carrier %    Enrichment   95% CI              
----------------------------------------------------------------------------------------------------
Pathway 0     1836     19/1817         1.03%       1.92x     [0.66, 1.61]
Pathway 1     11108    84/11024        0.76%       1.40x     [0.61, 0.94]
Pathway 2     4439     28/4411         0.63%       1.17x     [0.44, 0.91]
Pathway 3     7420     74/7346         1.00%       1.85x     [0.80, 1.25]

================================================================================
STATISTICAL COMPARISONS BETWEEN PATHWAYS
================================================================================

Pathway 0 vs Pathway 1:
  Carrier rates: 1.03% vs 0.76%
  Odds Ratio: 1.372
  Fisher's exact p-value: 2.0344e-01
  Not significant (p >= 0.05)

Pathway 0 vs Pathway 2:
  Carrier rates: 1.03% vs 0.63%
  Odds Ratio: 1.647
  Fisher's exact p-value: 1.0676e-01
  Not significant (p >= 0.05)

Pathway 0 vs Pathway 3:
  Carrier rates: 1.03% vs 1.00%
  Odds Ratio: 1.038
  Fisher's exact p-value: 8.9600e-01
  Not significant (p >= 0.05)

Pathway 1 vs Pathway 2:
  Carrier rates: 0.76% vs 0.63%
  Odds Ratio: 1.200
  Fisher's exact p-value: 4.6263e-01
  Not significant (p >= 0.05)

Pathway 1 vs Pathway 3:
  Carrier rates: 0.76% vs 1.00%
  Odds Ratio: 0.756
  Fisher's exact p-value: 8.6834e-02
  Not significant (p >= 0.05)

Pathway 2 vs Pathway 3:
  Carrier rates: 0.63% vs 1.00%
  Odds Ratio: 0.630
  Fisher's exact p-value: 3.9607e-02
  ✓ Significant difference (p < 0.05)

Saved signature 5 analysis plot: complete_pathway_analysis_output/ukb_pathway_discovery/signature5_analysis_myocardial_infarction.pdf
Saved FH carrier prevalence plot: complete_pathway_analysis_output/ukb_pathway_discovery/fh_carrier_prevalence_by_pathway_myocardial_infarction.pdf
Saved comprehensive pathway comparison plot: complete_pathway_analysis_output/ukb_pathway_discovery/comprehensive_pathway_comparison_myocardial_infarction.pdf

✅ Signature 5 analysis complete

STEP 4: Cross-Cohort Reproducibility¶

Validate pathway reproducibility across UKB and MGB cohorts:

In [6]:
# ============================================================================
# STEP 4: REPRODUCIBILITY VALIDATION
# ============================================================================
# Compare pathways discovered in UKB vs MGB cohorts

print("\n" + "="*80)
print("STEP 4: REPRODUCIBILITY VALIDATION")
print("="*80)

results['reproducibility'] = show_reproducibility(force_rerun_mgb=False)

print("\n✅ Reproducibility validation complete")

print("\n" + "="*80)
print("✅ ALL ANALYSES COMPLETE!")
print("="*80)
================================================================================
STEP 4: REPRODUCIBILITY VALIDATION
================================================================================
================================================================================
PATHWAY REPRODUCIBILITY ANALYSIS
================================================================================

Step 1: Getting pathway matches...
================================================================================
PATHWAY MATCHING: UKB ↔ MGB
================================================================================

1. Loading UKB results from: complete_pathway_analysis_output/ukb_pathway_discovery/complete_analysis_results.pkl
   ✅ UKB results loaded

2. Running MGB analysis (force_rerun=False)...
================================================================================
MGB DEVIATION-BASED PATHWAY ANALYSIS: MYOCARDIAL INFARCTION
================================================================================
================================================================================
LOADING MGB MODEL DATA
================================================================================

Loading MGB model from: /Users/sarahurbut/Dropbox-Personal/model_with_kappa_bigam_MGB.pt
Model keys: ['model_state_dict', 'clusters', 'psi', 'Y', 'prevalence_t', 'logit_prevalence_t', 'G', 'E', 'disease_names', 'hyperparameters']

MGB Data shapes:
  Y: (34592, 346, 51)
  Lambda: (34592, 21, 51)
  Thetas: (34592, 21, 51)
  Disease names: 346 diseases

MGB Dataset:
  Patients: 34,592
  Diseases: 346
  Signatures: 21
  Time points: 51

1. DISCOVERING PATHWAYS USING DEVIATION-FROM-REFERENCE METHOD
   (10-year lookback, 4 pathways)
=== DISCOVERING PATHWAYS TO MYOCARDIAL INFARCTION ===
Method: deviation_from_reference
Lookback years: 10
Found target disease: Myocardial infarction (index 112)
Found 2761 patients who developed myocardial infarction

Creating trajectory features for pathway discovery...
Method: deviation_from_reference

--- COMPUTING POPULATION REFERENCE FOR DEVIATION-BASED CLUSTERING ---
Computing population-level signature reference from all 34592 patients...
Population reference shape: (21, 51)
Created 210 features per patient (DEVIATION from reference)
  - 210 features: deviation per signature per timepoint (K signatures × 10 timepoints)
Kept 2692 patients with sufficient pre-disease history

Discovered 4 pathways to myocardial infarction:
  Pathway 0: 221 patients (8.0%)
  Pathway 1: 537 patients (19.4%)
  Pathway 2: 778 patients (28.2%)
  Pathway 3: 1156 patients (41.9%)
✅ Discovered 4 pathways in MGB

2. INTERROGATING MGB PATHWAYS
=== INTERROGATING PATHWAYS TO MYOCARDIAL INFARCTION ===

1. PATHWAY STATISTICS:
   Pathway 0: 221 patients (8.2%)
   Pathway 1: 537 patients (19.9%)
   Pathway 2: 778 patients (28.9%)
   Pathway 3: 1156 patients (42.9%)

2. CALCULATING SIGNATURE TRAJECTORIES:
   Pathway 0: 221 patients
   Pathway 1: 537 patients
   Pathway 2: 778 patients
   Pathway 3: 1156 patients

3. MOST DISCRIMINATING SIGNATURES:
   Top 5 discriminating signatures:
     1. Signature 20: Score = 0.7177
     2. Signature 11: Score = 0.6367
     3. Signature 6: Score = 0.5322
     4. Signature 13: Score = 0.5131
     5. Signature 12: Score = 0.4747

4. DISEASE PATTERNS BY PATHWAY (PRE-TARGET DISEASE):
   Pathway 0 top PRE-disease conditions:
     1. Pain in joint: 184 patients (83.3%)
     2. Essential hypertension: 158 patients (71.5%)
     3. Pain in limb: 145 patients (65.6%)
     4. Back pain: 141 patients (63.8%)
     5. Hyperlipidemia: 135 patients (61.1%)
     6. Spondylosis without myelopathy: 131 patients (59.3%)
     7. Osteoarthrosis, localized, primary: 125 patients (56.6%)
     8. Arthropathy NOS: 120 patients (54.3%)
     9. Spinal stenosis: 109 patients (49.3%)
     10. Enthesopathy: 107 patients (48.4%)
   Pathway 1 top PRE-disease conditions:
     1. Essential hypertension: 413 patients (76.9%)
     2. Hyperlipidemia: 338 patients (62.9%)
     3. Type 2 diabetes: 270 patients (50.3%)
     4. Coronary atherosclerosis: 261 patients (48.6%)
     5. Arrhythmia (cardiac) NOS: 249 patients (46.4%)
     6. Pain in joint: 229 patients (42.6%)
     7. Obesity: 218 patients (40.6%)
     8. Hypercholesterolemia: 208 patients (38.7%)
     9. Major depressive disorder: 204 patients (38.0%)
     10. GERD: 204 patients (38.0%)
   Pathway 2 top PRE-disease conditions:
     1. Essential hypertension: 251 patients (32.3%)
     2. Hyperlipidemia: 175 patients (22.5%)
     3. Coronary atherosclerosis: 152 patients (19.5%)
     4. Pain in joint: 123 patients (15.8%)
     5. Arrhythmia (cardiac) NOS: 115 patients (14.8%)
     6. GERD: 106 patients (13.6%)
     7. Hypercholesterolemia: 93 patients (12.0%)
     8. Pain in limb: 90 patients (11.6%)
     9. Back pain: 77 patients (9.9%)
     10. Benign neoplasm of skin: 75 patients (9.6%)
   Pathway 3 top PRE-disease conditions:
     1. Essential hypertension: 574 patients (49.7%)
     2. Hyperlipidemia: 445 patients (38.5%)
     3. Coronary atherosclerosis: 421 patients (36.4%)
     4. Arrhythmia (cardiac) NOS: 310 patients (26.8%)
     5. Hypercholesterolemia: 261 patients (22.6%)
     6. Pain in joint: 257 patients (22.2%)
     7. GERD: 229 patients (19.8%)
     8. Type 2 diabetes: 223 patients (19.3%)
     9. Pain in limb: 184 patients (15.9%)
     10. Obesity: 181 patients (15.7%)

4b. DISEASES THAT DIFFERENTIATE PATHWAYS:
   Top 15 diseases that differentiate pathways (by variance in prevalence):
     1. Pain in joint
        Pathway 0: 184 patients (83.3%)
        Pathway 1: 229 patients (42.6%)
        Pathway 2: 123 patients (15.8%)
        Pathway 3: 257 patients (22.2%)
     2. Pain in limb
        Pathway 0: 145 patients (65.6%)
        Pathway 1: 196 patients (36.5%)
        Pathway 2: 90 patients (11.6%)
        Pathway 3: 184 patients (15.9%)
     3. Spondylosis without myelopathy
        Pathway 0: 131 patients (59.3%)
        Pathway 1: 108 patients (20.1%)
        Pathway 2: 49 patients (6.3%)
        Pathway 3: 97 patients (8.4%)
     4. Back pain
        Pathway 0: 141 patients (63.8%)
        Pathway 1: 165 patients (30.7%)
        Pathway 2: 77 patients (9.9%)
        Pathway 3: 171 patients (14.8%)
     5. Osteoarthrosis, localized, primary
        Pathway 0: 125 patients (56.6%)
        Pathway 1: 103 patients (19.2%)
        Pathway 2: 56 patients (7.2%)
        Pathway 3: 113 patients (9.8%)
     6. Arthropathy NOS
        Pathway 0: 120 patients (54.3%)
        Pathway 1: 118 patients (22.0%)
        Pathway 2: 65 patients (8.4%)
        Pathway 3: 118 patients (10.2%)
     7. Spinal stenosis
        Pathway 0: 109 patients (49.3%)
        Pathway 1: 82 patients (15.3%)
        Pathway 2: 43 patients (5.5%)
        Pathway 3: 71 patients (6.1%)
     8. Essential hypertension
        Pathway 0: 158 patients (71.5%)
        Pathway 1: 413 patients (76.9%)
        Pathway 2: 251 patients (32.3%)
        Pathway 3: 574 patients (49.7%)
     9. Enthesopathy
        Pathway 0: 107 patients (48.4%)
        Pathway 1: 70 patients (13.0%)
        Pathway 2: 39 patients (5.0%)
        Pathway 3: 66 patients (5.7%)
     10. Osteoarthrosis NOS
        Pathway 0: 107 patients (48.4%)
        Pathway 1: 91 patients (16.9%)
        Pathway 2: 31 patients (4.0%)
        Pathway 3: 93 patients (8.0%)
     11. Hyperlipidemia
        Pathway 0: 135 patients (61.1%)
        Pathway 1: 338 patients (62.9%)
        Pathway 2: 175 patients (22.5%)
        Pathway 3: 445 patients (38.5%)
     12. Neuralgia, neuritis, and radiculitis NOS
        Pathway 0: 98 patients (44.3%)
        Pathway 1: 82 patients (15.3%)
        Pathway 2: 33 patients (4.2%)
        Pathway 3: 71 patients (6.1%)
     13. Type 2 diabetes
        Pathway 0: 62 patients (28.1%)
        Pathway 1: 270 patients (50.3%)
        Pathway 2: 58 patients (7.5%)
        Pathway 3: 223 patients (19.3%)
     14. Obesity
        Pathway 0: 89 patients (40.3%)
        Pathway 1: 218 patients (40.6%)
        Pathway 2: 56 patients (7.2%)
        Pathway 3: 181 patients (15.7%)
     15. Degeneration of intervertebral disc
        Pathway 0: 87 patients (39.4%)
        Pathway 1: 61 patients (11.4%)
        Pathway 2: 25 patients (3.2%)
        Pathway 3: 56 patients (4.8%)

5. CREATING PATHWAY VISUALIZATIONS:
   Saved plot: mgb_deviation_analysis_output/top_discriminating_signatures.pdf
No description has been provided for this image
   Saved plot: mgb_deviation_analysis_output/pathway_size_and_age.pdf
No description has been provided for this image
6. CREATING STACKED SIGNATURE DEVIATION PLOTS:
   Saved plot: mgb_deviation_analysis_output/signature_deviations_by_pathway.pdf
No description has been provided for this image
Summary of signature deviations (5 years before disease):
  Pathway 0: Total absolute deviation = 0.379
    Top 3 signatures: [(2, 0.093084544), (9, 0.07810839), (6, -0.047022022)]
  Pathway 1: Total absolute deviation = 0.350
    Top 3 signatures: [(6, 0.076894544), (5, 0.055115283), (15, -0.03755516)]
  Pathway 2: Total absolute deviation = 0.244
    Top 3 signatures: [(5, 0.061932586), (1, -0.052006155), (2, -0.03394649)]
  Pathway 3: Total absolute deviation = 0.238
    Top 3 signatures: [(5, 0.07462637), (6, 0.039245643), (2, -0.033039533)]

3. RUNNING STATISTICAL TESTS ON MGB PATHWAYS
================================================================================
COMPREHENSIVE STATISTICAL TESTS FOR PATHWAY GROUPS
================================================================================

1. Testing disease prevalence differences...
   Found 249 significantly different diseases (FDR < 0.05)

2. Testing signature trajectory differences...
   Found 21 signatures with significant differences (p < 0.05)

3. Testing age at disease onset differences...
   ANOVA: F=16.405, p=0.0000

4. Permutation test for pathway stability...
   Observed variance: 0.0001
   Permuted mean: 0.0000
   p-value: 0.0000

7. Calculating effect sizes...
   Effect sizes calculated for signatures, diseases, and age
   Saved summary: mgb_deviation_analysis_output/statistical_tests_summary.txt

✅ Results saved to mgb_deviation_analysis_output/
✅ Statistical tests complete

✅ Results saved to: mgb_deviation_analysis_output/mgb_deviation_analysis_results.pkl
   ✅ MGB analysis complete

3. Loading data for pathway matching...
Loading full dataset...
Loaded Y (full): torch.Size([407878, 348, 52])
Loaded thetas: (400000, 21, 52)
Loaded 400000 processed IDs
Subset Y to first 400K patients: torch.Size([400000, 348, 52])
Loaded 348 diseases
Total patients with complete data: 400000

4. Matching pathways by disease patterns...
================================================================================
MATCHING PATHWAYS BY DISEASE PATTERNS
================================================================================

Extracting disease enrichment patterns for each pathway...

UKB pathways: [0, 1, 2, 3]
MGB pathways: [0, 1, 2, 3]

Calculating pathway similarities...

Similarity Matrix (all pathway pairs):
UKB\MGB   MGB 0       MGB 1       MGB 2       MGB 3       
----------------------------------------------------------
UKB 0     0.649       0.572       0.498       0.528       
UKB 1     0.736       0.537       0.816       0.744       
UKB 2     0.610       0.851       0.559       0.629       
UKB 3     0.527       0.606       0.493       0.498       

Finding best pathway matches (using optimal assignment)...
  UKB Pathway 0 ↔ MGB Pathway 0 (similarity: 0.649)
  UKB Pathway 1 ↔ MGB Pathway 2 (similarity: 0.816)
  UKB Pathway 2 ↔ MGB Pathway 1 (similarity: 0.851)
  UKB Pathway 3 ↔ MGB Pathway 3 (similarity: 0.498)

Disease pattern matches:
--------------------------------------------------------------------------------

UKB Pathway 0 ↔ MGB Pathway 0 (similarity: 0.649)
  Matched 7 diseases. Top 5 matching diseases:
    UKB: Strabismus (not specified as paralytic) (enrichment: 1.88x)
    MGB: Thoracic or lumbosacral neuritis or radiculitis, unspecified (enrichment: 3.41x)
    UKB: Other diseases of the teeth and supporting structures (enrichment: 1.81x)
    MGB: Other disorders of synovium, tendon, and bursa (enrichment: 2.48x)
    UKB: Peripheral vascular disease, unspecified (enrichment: 1.69x)
    MGB: Peripheral enthesopathies and allied syndromes (enrichment: 2.46x)
    UKB: Diseases and other conditions of the tongue (enrichment: 1.65x)
    MGB: Other acquired deformities of limbs (enrichment: 2.72x)
    UKB: Renal failure NOS (enrichment: 1.61x)
    MGB: Synovitis and tenosynovitis (enrichment: 2.68x)

UKB Pathway 1 ↔ MGB Pathway 2 (similarity: 0.816)
  Matched 22 diseases. Top 5 matching diseases:
    UKB: Malignant neoplasm of ovary (enrichment: 1.35x)
    MGB: Malignant neoplasm of ovary (enrichment: 1.86x)
    UKB: Malignant neoplasm of uterus (enrichment: 1.32x)
    MGB: Malignant neoplasm of ovary (enrichment: 1.86x)
    UKB: Secondary malignant neoplasm of digestive systems (enrichment: 1.31x)
    MGB: Secondary malignant neoplasm of digestive systems (enrichment: 1.80x)
    UKB: Breast cancer [female] (enrichment: 1.30x)
    MGB: Breast cancer [female] (enrichment: 1.42x)
    UKB: Secondary malignancy of bone (enrichment: 1.25x)
    MGB: Secondary malignancy of bone (enrichment: 1.47x)

UKB Pathway 2 ↔ MGB Pathway 1 (similarity: 0.851)
  Matched 14 diseases. Top 5 matching diseases:
    UKB: Irregular menstrual bleeding (enrichment: 3.60x)
    MGB: Irregular menstrual cycle (enrichment: 2.30x)
    UKB: Irregular menstrual cycle (enrichment: 2.87x)
    MGB: Irregular menstrual cycle (enrichment: 2.30x)
    UKB: Pain and other symptoms associated with female genital organs (enrichment: 2.83x)
    MGB: Pain and other symptoms associated with female genital organs (enrichment: 1.79x)
    UKB: Hypertrophy of female genital organs (enrichment: 2.58x)
    MGB: Pain and other symptoms associated with female genital organs (enrichment: 1.79x)
    UKB: Other derangement of joint (enrichment: 2.51x)
    MGB: Other disorders of testis (enrichment: 2.01x)

UKB Pathway 3 ↔ MGB Pathway 3 (similarity: 0.498)
  Matched 12 diseases. Top 5 matching diseases:
    UKB: Chronic ulcer of skin (enrichment: 1.46x)
    MGB: Ulcer of esophagus (enrichment: 1.16x)
    UKB: Other local infections of skin and subcutaneous tissue (enrichment: 1.46x)
    MGB: Other non-epithelial cancer of skin (enrichment: 1.13x)
    UKB: Other acute and subacute forms of ischemic heart disease (enrichment: 1.45x)
    MGB: Other hypertrophic and atrophic conditions of skin (enrichment: 1.13x)
    UKB: Other disorders of testis (enrichment: 1.38x)
    MGB: Cholelithiasis with other cholecystitis (enrichment: 1.16x)
    UKB: Streptococcus infection (enrichment: 1.35x)
    MGB: Peritoneal adhesions (postoperative) (postinfection) (enrichment: 1.21x)

================================================================================
PATHWAY MATCHES
================================================================================

UKB Pathway     MGB Pathway     Similarity      Diseases Matched    
--------------------------------------------------------------------------------
Pathway 0            Pathway 0            0.649           7                   
  Top matching diseases:
    • Strabismus (not specified as paralytic) (UKB: 1.88x) ↔ Thoracic or lumbosacral neuritis or radiculitis, unspecified (MGB: 3.41x)
    • Other diseases of the teeth and supporting structures (UKB: 1.81x) ↔ Other disorders of synovium, tendon, and bursa (MGB: 2.48x)
    • Peripheral vascular disease, unspecified (UKB: 1.69x) ↔ Peripheral enthesopathies and allied syndromes (MGB: 2.46x)
    • Diseases and other conditions of the tongue (UKB: 1.65x) ↔ Other acquired deformities of limbs (MGB: 2.72x)
    • Renal failure NOS (UKB: 1.61x) ↔ Synovitis and tenosynovitis (MGB: 2.68x)
Pathway 1            Pathway 2            0.816           22                  
  Top matching diseases:
    • Malignant neoplasm of ovary (UKB: 1.35x) ↔ Malignant neoplasm of ovary (MGB: 1.86x)
    • Malignant neoplasm of uterus (UKB: 1.32x) ↔ Malignant neoplasm of ovary (MGB: 1.86x)
    • Secondary malignant neoplasm of digestive systems (UKB: 1.31x) ↔ Secondary malignant neoplasm of digestive systems (MGB: 1.80x)
    • Breast cancer [female] (UKB: 1.30x) ↔ Breast cancer [female] (MGB: 1.42x)
    • Secondary malignancy of bone (UKB: 1.25x) ↔ Secondary malignancy of bone (MGB: 1.47x)
Pathway 2            Pathway 1            0.851           14                  
  Top matching diseases:
    • Irregular menstrual bleeding (UKB: 3.60x) ↔ Irregular menstrual cycle (MGB: 2.30x)
    • Irregular menstrual cycle (UKB: 2.87x) ↔ Irregular menstrual cycle (MGB: 2.30x)
    • Pain and other symptoms associated with female genital organs (UKB: 2.83x) ↔ Pain and other symptoms associated with female genital organs (MGB: 1.79x)
    • Hypertrophy of female genital organs (UKB: 2.58x) ↔ Pain and other symptoms associated with female genital organs (MGB: 1.79x)
    • Other derangement of joint (UKB: 2.51x) ↔ Other disorders of testis (MGB: 2.01x)
Pathway 3            Pathway 3            0.498           12                  
  Top matching diseases:
    • Chronic ulcer of skin (UKB: 1.46x) ↔ Ulcer of esophagus (MGB: 1.16x)
    • Other local infections of skin and subcutaneous tissue (UKB: 1.46x) ↔ Other non-epithelial cancer of skin (MGB: 1.13x)
    • Other acute and subacute forms of ischemic heart disease (UKB: 1.45x) ↔ Other hypertrophic and atrophic conditions of skin (MGB: 1.13x)
    • Other disorders of testis (UKB: 1.38x) ↔ Cholelithiasis with other cholecystitis (MGB: 1.16x)
    • Streptococcus infection (UKB: 1.35x) ↔ Peritoneal adhesions (postoperative) (postinfection) (MGB: 1.21x)

================================================================================
SUMMARY
================================================================================

✅ Found 4 pathway matches
   Average similarity: 0.704
   High similarity matches (>0.5): 3/4

Step 2: Creating reproducibility visualizations...

================================================================================
CREATING REPRODUCIBILITY FIGURES
================================================================================
   ✅ Figure saved to: pathway_reproducibility_ukb_mgb.png
No description has been provided for this image
Step 3: Comparing signature patterns...

================================================================================
COMPARING SIGNATURE PATTERNS FOR MATCHED PATHWAYS
================================================================================

⚠️  IMPORTANT: Signature indices are arbitrary across cohorts.
   UKB Sig 5 may not correspond to MGB Sig 5 biologically.
   We compare overall patterns and biological content, not index alignment.
   Pathways are matched by disease enrichment patterns, not signature indices.

1. Computing signature deviation trajectories...
Loading full dataset...
Loaded Y (full): torch.Size([407878, 348, 52])
Loaded thetas: (400000, 21, 52)
Loaded 400000 processed IDs
Subset Y to first 400K patients: torch.Size([400000, 348, 52])
Loaded 348 diseases
Total patients with complete data: 400000
   UKB: 21 signatures, 52 timepoints
   MGB: 21 signatures, 51 timepoints

2. Calculating deviations for matched pathways...
   UKB Pathway 0 ↔ MGB Pathway 0: 1836 vs 221 patients
   UKB Pathway 1 ↔ MGB Pathway 2: 11108 vs 778 patients
   UKB Pathway 2 ↔ MGB Pathway 1: 4439 vs 537 patients
   UKB Pathway 3 ↔ MGB Pathway 3: 7420 vs 1156 patients

3. Computing signature correspondence...

   Using predefined signature correspondence from cross-tabulation...
   Using 21 signature correspondences from cross-tabulation
      MGB Sig 0 ↔ UKB Sig 4
      MGB Sig 1 ↔ UKB Sig 7
      MGB Sig 2 ↔ UKB Sig 1
      MGB Sig 3 ↔ UKB Sig 12
      MGB Sig 4 ↔ UKB Sig 16
      MGB Sig 5 ↔ UKB Sig 5
      MGB Sig 6 ↔ UKB Sig 15
      MGB Sig 7 ↔ UKB Sig 2
      MGB Sig 8 ↔ UKB Sig 17
      MGB Sig 9 ↔ UKB Sig 9
      MGB Sig 10 ↔ UKB Sig 11
      MGB Sig 11 ↔ UKB Sig 6
      MGB Sig 12 ↔ UKB Sig 3
      MGB Sig 13 ↔ UKB Sig 18
      MGB Sig 14 ↔ UKB Sig 14
      MGB Sig 15 ↔ UKB Sig 19
      MGB Sig 16 ↔ UKB Sig 10
      MGB Sig 18 ↔ UKB Sig 13
      MGB Sig 19 ↔ UKB Sig 8
      MGB Sig 20 ↔ UKB Sig 20

4. Creating signature deviation trajectory plots...
   ✅ Figure saved to: signature_deviation_trajectories_all_sigs_ukb_mgb.png

   ✅ Matching signatures are plotted with the same color.
      Matched pairs: 21 signatures
      Unmatched signatures shown in grayscale (dashed lines)
      Central legend shows all UKB-MGB signature connections
No description has been provided for this image
5. Creating signature deviation heatmaps...
   ✅ Heatmap saved to: signature_deviation_heatmaps_all_sigs_ukb_mgb.png
No description has been provided for this image
Step 4: Comparing PRS patterns (genetic validation)...

================================================================================
COMPARING PRS PATTERNS: STRONG REPRODUCIBILITY VALIDATION
================================================================================

1. Loading MGB PRS from model...
   ✅ MGB PRS shape: (34592, 36)

2. Loading UKB PRS...
   ⚠️  Could not load UKB PRS: [Errno 2] No such file or directory: '/Users/sarahurbut/aladynoulli2/pyScripts/prs_with_eid.csv'
   Skipping PRS comparison

================================================================================
✅ REPRODUCIBILITY ANALYSIS COMPLETE!
================================================================================

REPRODUCIBILITY SUMMARY:
--------------------------------------------------------------------------------
✅ 4 pathways matched between cohorts
✅ Disease pattern similarity: 0.704
✅ Proportion correlation: 0.702
✅ Age difference: 3.1 years

CONCLUSION: Deviation-based pathway discovery generalizes across cohorts!

✅ Reproducibility validation complete

================================================================================
✅ ALL ANALYSES COMPLETE!
================================================================================
In [15]:
# ============================================================================
# DISPLAY ALL PATHWAY VISUALIZATIONS
# ============================================================================

from IPython.display import Image, display, Markdown, HTML
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from pathlib import Path
import pandas as pd
import os

# Try to import PDF conversion libraries
try:
    from pdf2image import convert_from_path
    HAS_PDF2IMAGE = True
except ImportError:
    HAS_PDF2IMAGE = False

try:
    import fitz  # PyMuPDF
    HAS_PYMUPDF = True
except ImportError:
    HAS_PYMUPDF = False

output_path = Path(output_dir) / 'ukb_pathway_discovery'

# List of all visualizations that should be generated
visualizations = {
    'Top Discriminating Signatures': 'top_discriminating_signatures.pdf',
    'Pathway Size and Age': 'pathway_size_and_age.pdf',
    'Signature Deviations by Pathway': 'signature_deviations_by_pathway.pdf',
    'Signature Deviations (Line Plot)': 'signature_deviations_myocardial_infarction_10yr_line.pdf',
    'Signature Deviations (Stacked)': 'signature_deviations_myocardial_infarction_10yr_stacked.pdf',
    'PRS by Pathway': 'prs_by_pathway.pdf',
    'Signature 5 Analysis': 'signature5_analysis_myocardial_infarction.pdf',
    'FH Carrier Prevalence': 'fh_carrier_prevalence_by_pathway_myocardial_infarction.pdf',
    'Comprehensive Pathway Comparison': 'comprehensive_pathway_comparison_myocardial_infarction.pdf'
}

print("="*80)
print("PATHWAY ANALYSIS VISUALIZATIONS")
print("="*80)

def display_pdf_as_image(pdf_path, title):
    """Convert PDF to image and display it."""
    abs_path = os.path.abspath(pdf_path)
    
    # Try PyMuPDF first (faster)
    if HAS_PYMUPDF:
        try:
            doc = fitz.open(pdf_path)
            page = doc[0]  # First page
            pix = page.get_pixmap(matrix=fitz.Matrix(2, 2))  # 2x zoom for better quality
            img_data = pix.tobytes("png")
            doc.close()
            display(Image(img_data))
            return True
        except Exception as e:
            pass
    
    # Try pdf2image
    if HAS_PDF2IMAGE:
        try:
            images = convert_from_path(pdf_path, dpi=150, first_page=1, last_page=1)
            if images:
                # Convert PIL image to bytes
                from io import BytesIO
                buf = BytesIO()
                images[0].save(buf, format='PNG')
                buf.seek(0)
                display(Image(buf.getvalue()))
                return True
        except Exception as e:
            pass
    
    # Fallback: show link
    display(HTML(f'''
        <div style="margin: 10px 0; padding: 10px; border: 1px solid #ccc; background: #f5f5f5;">
            <p><strong>{title}</strong></p>
            <p>PDF file: <code>{pdf_path.name}</code></p>
            <p><a href="file://{abs_path}" target="_blank" style="color: #0066cc;">📄 Click here to open PDF in new tab</a></p>
        </div>
    '''))
    return False

# Display each visualization
for title, filename in visualizations.items():
    fig_path = output_path / filename
    
    if fig_path.exists():
        print(f"\n📊 {title}")
        print(f"   File: {filename}")
        
        # Display title
        display(Markdown(f"### {title}"))
        
        # Try to display as image
        display_pdf_as_image(fig_path, title)
        
    else:
        print(f"\n⚠️  {title}: {filename} (not found)")

print("\n" + "="*80)
print("✅ Visualization display complete")
if not HAS_PDF2IMAGE and not HAS_PYMUPDF:
    print("💡 Tip: Install 'PyMuPDF' (pip install pymupdf) or 'pdf2image' for better PDF display")
print("="*80)
================================================================================
PATHWAY ANALYSIS VISUALIZATIONS
================================================================================

📊 Top Discriminating Signatures
   File: top_discriminating_signatures.pdf

Top Discriminating Signatures¶

No description has been provided for this image
📊 Pathway Size and Age
   File: pathway_size_and_age.pdf

Pathway Size and Age¶

No description has been provided for this image
📊 Signature Deviations by Pathway
   File: signature_deviations_by_pathway.pdf

Signature Deviations by Pathway¶

No description has been provided for this image
📊 Signature Deviations (Line Plot)
   File: signature_deviations_myocardial_infarction_10yr_line.pdf

Signature Deviations (Line Plot)¶

No description has been provided for this image
📊 Signature Deviations (Stacked)
   File: signature_deviations_myocardial_infarction_10yr_stacked.pdf

Signature Deviations (Stacked)¶

No description has been provided for this image
📊 PRS by Pathway
   File: prs_by_pathway.pdf

PRS by Pathway¶

No description has been provided for this image
📊 Signature 5 Analysis
   File: signature5_analysis_myocardial_infarction.pdf

Signature 5 Analysis¶

No description has been provided for this image
📊 FH Carrier Prevalence
   File: fh_carrier_prevalence_by_pathway_myocardial_infarction.pdf

FH Carrier Prevalence¶

No description has been provided for this image
📊 Comprehensive Pathway Comparison
   File: comprehensive_pathway_comparison_myocardial_infarction.pdf

Comprehensive Pathway Comparison¶

No description has been provided for this image
================================================================================
✅ Visualization display complete
================================================================================
In [8]:
# ============================================================================
# DISPLAY SUMMARY STATISTICS AND DATA TABLES
# ============================================================================

output_path = Path(output_dir) / 'ukb_pathway_discovery'

# Check for summary file
summary_file = output_path / 'analysis_summary.txt'
if summary_file.exists():
    print("="*80)
    print("ANALYSIS SUMMARY")
    print("="*80)
    with open(summary_file, 'r') as f:
        print(f.read())

# Check for disease prevalence
prevalence_file = output_path / 'disease_prevalence_tests.csv'
if prevalence_file.exists():
    print("\n" + "="*80)
    print("DISEASE PREVALENCE BY PATHWAY")
    print("="*80)
    prevalence_df = pd.read_csv(prevalence_file)
    print(f"\nLoaded {len(prevalence_df)} disease-pathway comparisons")
    display(prevalence_df.head(20))

# Load signature discrimination table
sig_disc_file = output_path / 'signature_discrimination_table.csv'
if sig_disc_file.exists():
    print("\n" + "="*80)
    print("SIGNATURE DISCRIMINATION")
    print("="*80)
    sig_disc_df = pd.read_csv(sig_disc_file)
    display(sig_disc_df)

# Load top diseases table
top_diseases_file = output_path / 'top_20_diseases_table.csv'
if top_diseases_file.exists():
    print("\n" + "="*80)
    print("TOP DISEASES BY PATHWAY")
    print("="*80)
    top_diseases_df = pd.read_csv(top_diseases_file)
    display(top_diseases_df)

# Load effect sizes if available (as PNG)
effect_sizes_file = output_path / 'effect_sizes_heatmap.png'
if effect_sizes_file.exists():
    print("\n" + "="*80)
    print("EFFECT SIZES HEATMAP")
    print("="*80)
    try:
        img = mpimg.imread(effect_sizes_file)
        plt.figure(figsize=(14, 10))
        plt.imshow(img)
        plt.axis('off')
        plt.title('Effect Sizes by Pathway', fontsize=16, fontweight='bold')
        plt.tight_layout()
        plt.show()
    except Exception as e:
        print(f"   ⚠️  Could not display: {e}")
================================================================================
ANALYSIS SUMMARY
================================================================================
================================================================================
PATHWAY ANALYSIS SUMMARY: MYOCARDIAL INFARCTION
Method: Deviation-from-Reference (10-year lookback)
================================================================================

PATHWAY SIZES:
--------------------------------------------------------------------------------
Pathway 0:  1836 patients (  7.4%)
Pathway 1: 11108 patients ( 44.8%)
Pathway 2:  4439 patients ( 17.9%)
Pathway 3:  7420 patients ( 29.9%)


In [9]:
# ============================================================================
# DISPLAY ADDITIONAL VISUALIZATIONS
# ============================================================================

# Check for transition analysis visualizations
transition_path = Path(output_dir) / 'transition_analysis'
if transition_path.exists():
    print("="*80)
    print("TRANSITION ANALYSIS VISUALIZATIONS")
    print("="*80)
    
    transition_figs = [
        'ukb_mgb_comparison_rheumatoid_arthritis_to_myocardial_infarction.png'
    ]
    
    for fig_name in transition_figs:
        fig_path = transition_path / fig_name
        if fig_path.exists():
            print(f"\n📊 {fig_name}")
            try:
                img = mpimg.imread(fig_path)
                plt.figure(figsize=(14, 10))
                plt.imshow(img)
                plt.axis('off')
                plt.title(fig_name.replace('_', ' ').replace('.png', ''), fontsize=14, fontweight='bold')
                plt.tight_layout()
                plt.show()
            except Exception as e:
                print(f"   ⚠️  Could not display: {e}")

# Check for reproducibility visualizations (in current directory)
reproducibility_figs = [
    'pathway_reproducibility_ukb_mgb.png',
    'signature_deviation_trajectories_all_sigs_ukb_mgb.png',
    'signature_deviation_heatmaps_all_sigs_ukb_mgb.png'
]

print("\n" + "="*80)
print("REPRODUCIBILITY VISUALIZATIONS")
print("="*80)

for fig_name in reproducibility_figs:
    fig_path = Path(fig_name)
    if fig_path.exists():
        print(f"\n📊 {fig_name}")
        try:
            img = mpimg.imread(fig_path)
            plt.figure(figsize=(14, 10))
            plt.imshow(img)
            plt.axis('off')
            plt.title(fig_name.replace('_', ' ').replace('.png', ''), fontsize=14, fontweight='bold')
            plt.tight_layout()
            plt.show()
        except Exception as e:
            print(f"   ⚠️  Could not display: {e}")
    else:
        print(f"\n⚠️  {fig_name} (not found - may need to run reproducibility analysis)")
================================================================================
TRANSITION ANALYSIS VISUALIZATIONS
================================================================================

📊 ukb_mgb_comparison_rheumatoid_arthritis_to_myocardial_infarction.png
No description has been provided for this image
================================================================================
REPRODUCIBILITY VISUALIZATIONS
================================================================================

📊 pathway_reproducibility_ukb_mgb.png
No description has been provided for this image
📊 signature_deviation_trajectories_all_sigs_ukb_mgb.png
No description has been provided for this image
📊 signature_deviation_heatmaps_all_sigs_ukb_mgb.png
No description has been provided for this image

Summary: Heterogeneity Demonstrated¶

This pathway analysis demonstrates all three types of heterogeneity:

  1. Patient Heterogeneity: Different MI patients have different signature profiles

    • Multiple distinct pathways identified
    • Each pathway represents a subset of MI patients
  2. Biological Heterogeneity: Same phenotype (MI) arises from different mechanisms

    • Different pathways show distinct signature patterns
    • Different disease associations
    • Different biological processes
  3. Disease Heterogeneity: "MI" is not a single entity

    • Different pathways show different disease associations
    • Different genetic risk profiles (if PRS data available)
    • Different signature activation patterns

Key Finding: Our model captures this heterogeneity through individual-specific signature loadings, allowing us to identify distinct biological pathways that lead to the same clinical outcome.

In [10]:
# ============================================================================
# LIST ALL GENERATED FILES
# ============================================================================

print("="*80)
print("COMPLETE FILE LISTING")
print("="*80)

if output_path.exists():
    print(f"\n📁 Output directory: {output_path}")
    print(f"\nGenerated files:")
    
    # List all files
    all_files = sorted(output_path.glob('*'))
    pdf_files = [f for f in all_files if f.suffix == '.pdf']
    png_files = [f for f in all_files if f.suffix == '.png']
    csv_files = [f for f in all_files if f.suffix == '.csv']
    txt_files = [f for f in all_files if f.suffix == '.txt']
    
    if pdf_files:
        print(f"\n  PDF files ({len(pdf_files)}):")
        for f in pdf_files:
            print(f"    - {f.name}")
    
    if png_files:
        print(f"\n  PNG files ({len(png_files)}):")
        for f in png_files:
            print(f"    - {f.name}")
    
    if csv_files:
        print(f"\n  CSV files ({len(csv_files)}):")
        for f in csv_files:
            print(f"    - {f.name}")
    
    if txt_files:
        print(f"\n  TXT files ({len(txt_files)}):")
        for f in txt_files:
            print(f"    - {f.name}")
else:
    print(f"\n⚠️  Output directory not found: {output_path}")
    print("   Run the analysis cell above first!")

print("\n💡 Key Insight:")
print("   This analysis demonstrates biological heterogeneity:")
print("   - Patient Heterogeneity: Different patients have different signature profiles")
print("   - Biological Heterogeneity: Same phenotype (MI) from different mechanisms")
print("   - Disease Heterogeneity: 'MI' is not a single entity")
================================================================================
COMPLETE FILE LISTING
================================================================================

📁 Output directory: complete_pathway_analysis_output/ukb_pathway_discovery

Generated files:

  PDF files (9):
    - comprehensive_pathway_comparison_myocardial_infarction.pdf
    - fh_carrier_prevalence_by_pathway_myocardial_infarction.pdf
    - pathway_size_and_age.pdf
    - prs_by_pathway.pdf
    - signature5_analysis_myocardial_infarction.pdf
    - signature_deviations_by_pathway.pdf
    - signature_deviations_myocardial_infarction_10yr_line.pdf
    - signature_deviations_myocardial_infarction_10yr_stacked.pdf
    - top_discriminating_signatures.pdf

  TXT files (2):
    - analysis_summary.txt
    - complete_analysis_log.txt

💡 Key Insight:
   This analysis demonstrates biological heterogeneity:
   - Patient Heterogeneity: Different patients have different signature profiles
   - Biological Heterogeneity: Same phenotype (MI) from different mechanisms
   - Disease Heterogeneity: 'MI' is not a single entity

Methodological Note¶

Important distinction from main paper:

  • This analysis: Uses deviation-from-reference clustering for pathway discovery

    • Clusters patients based on how their signature trajectories deviate from population average
    • Removes age confounding by centering on population reference
    • Useful for identifying distinct biological pathways
  • Main paper (line 238): Uses k-means clustering on time-averaged signature loadings

    • Clusters patients based on their average signature levels across time
    • Then visualizes deviations from reference over time
    • More interpretable for clinical applications

Both approaches demonstrate heterogeneity, but serve different purposes:

  • Deviation-based: Better for pathway discovery (identifying distinct mechanisms)
  • Time-averaged: Better for patient stratification (identifying risk groups)

References¶

  • Main pathway analysis: heterogeneity_analysis_summary.ipynb
  • Pathway discovery code: run_complete_pathway_analysis_deviation_only.py
  • Related heterogeneity discussion: R3_Q8_Heterogeneity.ipynb
  • Main paper method: trajectory_and_prs_cluster.R (line 238)